Comparing Standard Deviation in a loop function with R - r

I have a data set with 399 rows and 7 columns. Each row is made by some NA and some values. What I want to do is to create a new data frame with all the possible combinations of 3 elements for each row. Let's say that row one has 4 elements so I want that the new data frame, on row one, has 4 columns with the standard deviations of all the combinations of 3 elements of row 1(of the original Data Set).
This is the head of the original Data Set:
V1 V2 V3 V4 V5 V6 V7
1 0.0853146 0.0809561 0.1350686 NA NA NA NA
2 0.0788104 0.0964276 0.1222457 0.0853146 NA NA NA
3 0.1086917 0.0818920 0.0479148 0.0981603 0.0788104 NA NA
4 0.0811772 0.1088340 0.1823510 0.0809561 0.0964276 0.1086917 NA
5 0.1015970 0.1089944 0.1243186 0.0858065 0.0842896 0.0818920 0.0811772
6 0.0639869 0.1496792 0.1704337 0.1088340 0.1015970 NA NA
7 0.0619823 0.0962283 0.1089944 0.0639869 NA NA NA
The problem is that I can't remove the NAs so that I get the wrong number of combinations and therefore the wrong number of standard deviations.
Here what I come up with, but it does not work.
mydf<-as.matrix(df, na.rm=TRUE)
row<-apply(mydf, na.rm=TRUE, MARGIN = 1, FUN =combn, m=3, simplify = TRUE)
row<-as.matrix((row))
stdeviation<-apply(row,MARGIN = 1, FUN=sd,na.rm=TRUE)
stdeviation<-as.data.frame(stdeviation)
The table of the combinations looks like this for row 2:
V1 V2 V3
0.0788104313282292 0.0964276223058486 0.122245745410429
0.0788104313282292 0.0964276223058486 0.0853146853146852
0.0788104313282292 0.122245745410429 0.0853146853146852
0.0964276223058486 0.122245745410429 0.0853146853146852
The output for the second column, which I managed to do, looks like
V1 V2 V3 V4
stdeviation 0.02184631 0.008908499 0.02342661 0.01894719

Related

How to read data with many blank fields in R

I have a tab-delimited file that looks like this:
"ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
I use this code to read in the data:
df <- read.table("path/to/file",header=TRUE,fill=TRUE)
The result is this:
df
id V1 V2 V3 V4 V5
1 1 A 1 NA NA NA
2 2 B 2 NA NA NA
But I expect this:
df
id V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
I've tried sep="\t" and na.strings=c(""," ",NULL) but those don't help.
I can't get it to work with read.table, so how about parsing the string the manual way
ss <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
library(tidyverse)
entries <- unlist(str_split(ss, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
Explanation: We split the string on "\t"; the first occurrence of "\n" tells us how many columns we have. We then tidy up the entries by removing the line break characters "\n", reshape as matrix and then as data.frame, fix the header, and let readr::parse_guess guess the data type of every column.
For good measure we can roll everything into a function
read.my.data <- function(s) {
entries <- unlist(str_split(s, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
}
and confirm
read.my.data(ss)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
data.table's fread() had no problem reading in the string... but your data seems to have a \t too many (after each \n), which causes the creation of an extra column.
It is probably best practive to fix this in your export that creates your files.
If this is not possible, you can adjust fread()'s arguments to get the desired output.
Here we use drop do delete the first column that was created due to the the extra \t.
To get the right column-names back, we read the first line of the file again
string <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
data.table::fread( string,
drop = 1,
fill = TRUE,
col.names = as.matrix( fread(string, nrows = 1, header = FALSE))[1,] )
ID V1 V2 V3 V4 V5
1: 1 A NA NA NA 1
2: 2 B NA NA NA 2
As Quar already mentioned in his/her comment, your file has an extra tab in the beginning of every line, so the number of column labels does not match the number of data fields:
> foo <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
> cat(foo, "\n")
ID V1 V2 V3 V4 V5
1 A 1
2 B 2
That would be ok if the additional first column contained unique row names.
So there are two ways to address the problem: 1. remove the empty column (ideally by fixing the process that produced that file) or 2. fix the row name issue.
Here is my suggestion using the second option:
As the data is tab separated, I'd use read.delim which is just read table with reasonable defaults for this kind of file. Of course that throws an error when used w/o some tweaking ("duplicate 'row.names' are not allowed"). To fix that, we need to tell it to use automatic row numbering. That way you get almost exactly what you want:
> read.delim(text=foo, row.names=NULL)
row.names ID V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
All that's left to do is get rid of the row.names column. Alternatively, you may want the ID column to be turned into row.names:
> read.delim(text=foo, row.names='ID')
row.names V1 V2 V3 V4 V5
1 A NA NA NA 1
2 B NA NA NA 2
Hope that helps.

Checking if a column name exists in another dataset

So I have two different datasets and I am trying to check if a column name has a duplicate column name in another data set. For example:
V1 V2 V3
1 2 3
as one data set and
V4 V6 V1 V2
NA NA NA NA
And I am trying to make it so the second data set is like this
V4 V6 V1 V2
NA NA 1 NA
where only the minimum value in the original data set copies over, if that makes since. I have tried using this function:
if(ncol((Session1t[grep(temp1, names(Session1t))])) != 0)
But this is not working. It returns the same value regardless of what is input. After entering the if statement I then work to copy only the column that I want over,and I have that figured out, I just cannot get the if statement to work effectively.
We can use ifelse and %in% to match column names and replace NA with 1.
# Create example data frame D1
D1 <- read.table(text = "V1 V2 V3
1 2 3",
header = TRUE)
# Create example data frame D2
D2 <- read.table(text = "V4 V6 V1 V2
NA NA NA NA",
header = TRUE)
# Replace NA to 1 if column names match
D2[1, ] <- ifelse(names(D2) %in% names(D1), 1, NA)
D2
# V4 V6 V1 V2
# 1 NA NA 1 1
Or another option is intersect
nm1 <- intersect(names(df1), names(df2))
df2[nm1] <- df1[nm1]

sieve out non-NA entries from data frame while retaining rows with only NA

I am looking for a more efficient way (in terms of length of code) of converting a data.frame from:
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 1 2 3 NA NA NA NA NA NA
# 2 NA NA NA 3 2 1 NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA 1 2 3
to
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 3 2 1
#[3,] NA NA NA
#[4,] NA NA NA
#[5,] 1 2 3
That is, I want to remove excess NAs but correctly represent rows with only NAs.
I wrote the following function which does the job, but I am sure there is a less lengthy way of achieving the same.
#Dummy data.frame
data <- matrix(c(1:3, rep(NA, 6),
rep(NA, 3), 3:1, rep(NA, 3),
rep(NA, 9),
rep(NA, 9),
rep(NA, 6), 1:3),
byrow=TRUE, ncol=9)
data <- as.data.frame(data)
sieve <- function(data) {
#get a list of all entries that are not NA
cond <- apply(data, 1, function(x) x[!is.na(x)])
#set integer(0) equal to NA
cond[sapply(cond, function(x) length(x)==0)] <- NA
#check how many items there are in non-empty rows
#(rows are either empty or contain the same number of items)
n <- max(sapply(cond, length))
#replace single NA with n NAs, where n = number of items
#first get an index of entries with single NAs
index <- (1:length(cond)) [sapply(cond, function(x) length(x)==1)]
#then replace each entry with n NAs
for (i in index) cond[[i]] <- rep(NA, n)
#turn list into a data.frame
cond <- matrix(unlist(cond), nrow=length(cond), byrow=TRUE)
cond
}
sieve(data)
My question resembles this question about extracting conditions to which participants are assigned (for which I received great answers). I tried expanding these answers to the current dummy data, but without success so far. Hence my rather lengthy custom function.
Edit: Additional info for why I am asking this question: The first data frame represents the raw output from an experiment in which I assigned participants to one of three conditions (using 3 here for simplicity). In each condition, participants read a different scenario, but then answered the same set of questions about the scenario they had read. Qualtrics recorded answers from participants in the first condition in the columns V1through V3, answers from participants in the second condition in the columns V4through V6 and answers from participants in the third condition in columns V7through V9. (If this block of questions would have contained 4 questions it would have been columns V1 through V4 for answers from participants in the first condition, V2 through V8 for answers from participants in the second condition ...).
You can try this if the length of non-NAs is always the same in rows that aren't entirely filled with NA:
First, create a data frame with the appropriate (transposed) dimensions, and fill it with NAs.
d2 <- data.frame(
matrix(nrow = max(apply(d, 1, function(ii) sum(!is.na(ii)))),
ncol=nrow(d)))
Then, using apply fill that data frame, then transpose it to get your desired outcome:
d2[] <- apply(d, 1, function(ii) ii[!is.na(ii)])
t(d2)
# [,1] [,2] [,3]
#X1 1 2 3
#X2 3 2 1
#X3 NA NA NA
#X4 NA NA NA
#X5 1 2 3

Add dataframes to eachother by row retaining all columns in R

I have 3 dataframes that I would like to bind together by row but also retain the columns that each one has such that columns not present in one dataframe are just initialized to NA and added to the resultant dataframe. Since I may have many more columns than the ones provided in the example below, I can't hardcode them as I have been doing so far.
a=data.frame(v1=rnorm(10),v2=rnorm(10),v3=rnorm(10))
b=data.frame(v1=rnorm(10),v3=rnorm(10),v4=rnorm(10))
c=data.frame(v2=rnorm(10),v5=rnorm(10),v6=rnorm(10))
Desired output:
Dimensions of 30 by 6 with an output header of
v1 v2 v3 v4 v5 v6
0.0.. 0.0.. 0.0.. NA NA NA
0.0.. NA 0.0.. 0.0.. NA NA
NA 0.0.. NA NA 0.0.. 0.0..
etc.
How do I achieve this in a scaleable and efficient way?
Try:
library(dplyr)
bind_rows(a, b, c)
From the documentation:
When row-binding, columns are matched by name, and any values that don't match will be filled with NA.
This is likely to be faster.
library(data.table)
result <- rbindlist(list(a,b,c), fill=TRUE)
result[c(1:2,11:12,21:22),]
# v1 v2 v3 v4 v5 v6
# 1: -0.7789103 0.9362939 -1.3353714 NA NA NA
# 2: 1.7435594 -1.0624084 1.2827752 NA NA NA
# 3: -0.8456543 NA 0.6196773 -1.6647646 NA NA
# 4: -1.2504797 NA -1.2812387 0.9288518 NA NA
# 5: NA 1.1489591 NA NA 1.3822840 -1.8260830
# 6: NA -0.8424763 NA NA 0.1684902 0.9952818

What is the difference between cor and cor.test in R

I have a data frame that its columns are different samples of an experiment. I wanted to find the correlation between these samples. So the correlation between sample v2 and v3, between sample v2 and v4, ....
This is the data frame:
> head(t1)
V2 V3 V4 V5 V6
1 0.12725011 0.051021886 0.106049328 0.09378767 0.17799444
2 0.86096784 1.263327211 3.073650624 0.75607466 0.92244361
3 0.45791031 0.520207274 1.526476608 0.67499102 0.49817761
4 0.00000000 0.001139721 0.003158557 0.00000000 0.00000000
5 0.13383965 0.098943019 0.099922146 0.13871867 0.09750611
6 0.01016334 0.010187671 0.025410170 0.00000000 0.02369374
> nrow(t1)
[1] 23367
if I run the cor function for this data frame to get the correlation between samples(columns) I get NA for all the samples:
> cor(t1, method= "spearman")
V2 V3 V4 V5 V6
V2 1 NA NA NA NA
V3 NA 1 NA NA NA
V4 NA NA 1 NA NA
V5 NA NA NA 1 NA
V6 NA NA NA NA 1
but if I run this :
> cor.test(t1[,1],t1[,2], method="spearman")$estimate
rho
0.92394
it is different. Why is this so? What is the correct way of getting correlation between these samples?
Thank you in advance.
Your data contains NA values.
From ?cor:
If use is "everything", NAs will propagate conceptually, i.e., a
resulting value will be NA whenever one of its contributing
observations is NA.
From ?cor.test
na.action a function which indicates what should happen when the data
contain NAs. Defaults to getOption("na.action").
On my system:
getOption("na.action")
[1] "na.omit"
Use which(!is.finite(t1)) to search for problematic values and which(is.na(t1)) to search for NA values. cor returns NaN if you have Inf values in your data.

Resources