Add dataframes to eachother by row retaining all columns in R - r

I have 3 dataframes that I would like to bind together by row but also retain the columns that each one has such that columns not present in one dataframe are just initialized to NA and added to the resultant dataframe. Since I may have many more columns than the ones provided in the example below, I can't hardcode them as I have been doing so far.
a=data.frame(v1=rnorm(10),v2=rnorm(10),v3=rnorm(10))
b=data.frame(v1=rnorm(10),v3=rnorm(10),v4=rnorm(10))
c=data.frame(v2=rnorm(10),v5=rnorm(10),v6=rnorm(10))
Desired output:
Dimensions of 30 by 6 with an output header of
v1 v2 v3 v4 v5 v6
0.0.. 0.0.. 0.0.. NA NA NA
0.0.. NA 0.0.. 0.0.. NA NA
NA 0.0.. NA NA 0.0.. 0.0..
etc.
How do I achieve this in a scaleable and efficient way?

Try:
library(dplyr)
bind_rows(a, b, c)
From the documentation:
When row-binding, columns are matched by name, and any values that don't match will be filled with NA.

This is likely to be faster.
library(data.table)
result <- rbindlist(list(a,b,c), fill=TRUE)
result[c(1:2,11:12,21:22),]
# v1 v2 v3 v4 v5 v6
# 1: -0.7789103 0.9362939 -1.3353714 NA NA NA
# 2: 1.7435594 -1.0624084 1.2827752 NA NA NA
# 3: -0.8456543 NA 0.6196773 -1.6647646 NA NA
# 4: -1.2504797 NA -1.2812387 0.9288518 NA NA
# 5: NA 1.1489591 NA NA 1.3822840 -1.8260830
# 6: NA -0.8424763 NA NA 0.1684902 0.9952818

Related

Pivot_wider in tidyr creates list cols even when there is no duplicate or missing data

This is my code:
# reading input file
library(readxl)
df_testing <- read_excel("Testing_Data.xlsx")
# Renaming the 1st column name for ease of use
colnames(df_testing)[1] = "Tag_No"
View(df_testing)
# creating a new data frame with columns from the row values
library(tidyr)
df_output = pivot_wider(df_testing, names_from = Tag_No, values_from = Reading)
# the below output is as expected, yet coming in list cols
View(df_output)
# this below code is an attempt to fix but replaces last row values with NA
# df_output = lapply(df_output, unlist)
# df_output = data.frame(lapply(df_output, `length<-`, max(lengths(df_output))))
# level count should be equal to no of columns created
length(levels(df_testing$Tag_No)) == ncol(df_output) - 3
# save output to the file. Since, it is in list cols, I can't save the data to the file
write.csv(df_output, file = "Output File.csv")
This is the input data
file link 1
This is the sample of expected output data
file link 2
Any changes for the code to work correctly without loosing data or a complete solution is welcome. Thanks in advance. If I misunderstood the concept of pivot_wider usage, kindly give some tips.
The issue is because of NA values. There is around 59 rows with NA in them.
library(readxl)
library(tidyr)
df_testing <- read_excel("Testing_Data.xlsx")
df_testing %>% filter(is.na(`Tag No.`))
# A tibble: 59 x 4
# `Tag No.` Reading Date Time
# <chr> <dbl> <dttm> <dttm>
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 NA NA NA NA
# 5 NA NA NA NA
# 6 NA NA NA NA
# 7 NA NA NA NA
# 8 NA NA NA NA
# 9 NA NA NA NA
#10 NA NA NA NA
# … with 49 more rows
Dropping the NA rows doesn't give list columns.
df_output <- pivot_wider(na.omit(df_testing), names_from = `Tag No.`, values_from = Reading)
df_output

Comparing Standard Deviation in a loop function with R

I have a data set with 399 rows and 7 columns. Each row is made by some NA and some values. What I want to do is to create a new data frame with all the possible combinations of 3 elements for each row. Let's say that row one has 4 elements so I want that the new data frame, on row one, has 4 columns with the standard deviations of all the combinations of 3 elements of row 1(of the original Data Set).
This is the head of the original Data Set:
V1 V2 V3 V4 V5 V6 V7
1 0.0853146 0.0809561 0.1350686 NA NA NA NA
2 0.0788104 0.0964276 0.1222457 0.0853146 NA NA NA
3 0.1086917 0.0818920 0.0479148 0.0981603 0.0788104 NA NA
4 0.0811772 0.1088340 0.1823510 0.0809561 0.0964276 0.1086917 NA
5 0.1015970 0.1089944 0.1243186 0.0858065 0.0842896 0.0818920 0.0811772
6 0.0639869 0.1496792 0.1704337 0.1088340 0.1015970 NA NA
7 0.0619823 0.0962283 0.1089944 0.0639869 NA NA NA
The problem is that I can't remove the NAs so that I get the wrong number of combinations and therefore the wrong number of standard deviations.
Here what I come up with, but it does not work.
mydf<-as.matrix(df, na.rm=TRUE)
row<-apply(mydf, na.rm=TRUE, MARGIN = 1, FUN =combn, m=3, simplify = TRUE)
row<-as.matrix((row))
stdeviation<-apply(row,MARGIN = 1, FUN=sd,na.rm=TRUE)
stdeviation<-as.data.frame(stdeviation)
The table of the combinations looks like this for row 2:
V1 V2 V3
0.0788104313282292 0.0964276223058486 0.122245745410429
0.0788104313282292 0.0964276223058486 0.0853146853146852
0.0788104313282292 0.122245745410429 0.0853146853146852
0.0964276223058486 0.122245745410429 0.0853146853146852
The output for the second column, which I managed to do, looks like
V1 V2 V3 V4
stdeviation 0.02184631 0.008908499 0.02342661 0.01894719

how to make a text parsing function efficient in R

I have this function which calculates the consonanceScore of a book. First I import the phonetics dictionary from CMU (which forms a dataframe of about 134000 rows and 33 column variables; any row in the CMUdictionary is basically of the form CLOUDS K L AW1 D Z. The first column has the words, and the remaining columns have their phonetic equivalents). After getting the CMU dictionary, I parse a book into a vector containing all the words; max-length of any one book (so far): 218711 . Each word's phonetics are compared with the phonetics in the consecutive word, and the consecutive+1 word. The TRUE match values are then combined into a sum. The function I have is this:
getConsonanceScore <- function(book, consonanceScore, CMUdict) {
for (i in 1:((length(book)) - 2)) {
index1 <- replaceIfEmpty(which (toupper(book[i]) == CMUdict[,1]))
index2 <- replaceIfEmpty(which (toupper(book[i + 1]) == CMUdict[,1]))
index3 <- replaceIfEmpty(which (toupper(book[i + 2]) == CMUdict[,1]))
word1 <- as.character(CMUdict[index1, which(CMUdict[index1,] != "")])
word2 <- as.character(CMUdict[index2, which(CMUdict[index2,] != "")])
word3 <- as.character(CMUdict[index3, which(CMUdict[index3,] != "")])
consonanceScore <- sum(word1 %in% word2)
consonanceScore <- consonanceScore + sum(word1 %in% word3)
consonanceScore <- consonanceScore / length(book)
}
return(consonanceScore)
}
A replaceIfEmpty function basically just returns the index for a dummy value (that has been declared in the last row of the dataframe) if there is no match found in the CMU dictionary for any word in the book. It goes like this:
replaceIfEmpty <- function(x) {
if (length(x) > 0)
{
return (x)
}
else
{
x = 133780
return(x)
}
}
The issue that I am facing is that getConsonanceScore function takes a lot of time. So much so that in the loop, I had to divide the book length by 1000 just to check if the function was working alright. I am new to R, and would really be grateful for some help on making this function more efficient and consume less time, are there any ways of doing this? (I have to later call this function on possibly 50-100 books) Thanks a lot!
I've re-read recently your question, comments and #wibeasley's answer and got that didn't understand everything correctly. Now it have become more clear, and I'll try to suggest something useful.
First of all, we need a small example to work with. I've made it from the dictionary in your link.
dictdf <- read.table(text =
"A AH0
CALLED K AO1 L D
DOG D AO1 G
DOGMA D AA1 G M AH0
HAVE HH AE1 V
I AY1",
header = F, col.names = paste0("V", 1:25), fill = T, stringsAsFactors = F )
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
bookdf <- data.frame(words = c("I", "have", "a", "dog", "called", "Dogma"))
# words
# 1 I
# 2 have
# 3 a
# 4 dog
# 5 called
# 6 Dogma
Here we read data from dictionary with fill = T and manually define number of columns in data.frame by setting col.names. You may make 50, 100 or some other number of columns (but I don't think there are so long words in the dictionary). And we make a bookdf - a vector of words in the form of data.frame.
Then let's merge book and dictionary together. I use dplyr library mentioned by #wibeasley.
# for big data frames dplyr does merging fast
require("dplyr")
# make all letters uppercase
bookdf[,1] <- toupper(bookdf[,1])
# merge
bookphon <- left_join(bookdf, dictdf, by = c("words" = "V1"))
# words V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
And after that we scan rowwise for matching sounds in consecutive words. I arranged it with the help of sapply.
consonanceScore <-
sapply(1:(nrow(bookphon)-2),
conScore <- function(i_row)
{
word1 <- bookphon[i_row,][,-1]
word2 <- bookphon[i_row+1,][,-1]
word3 <- bookphon[i_row+2,][,-1]
word1 <- unlist( word1[which(!is.na(word1) & word1 != "")] )
word2 <- unlist( word2[which(!is.na(word2) & word2 != "")] )
word3 <- unlist( word3[which(!is.na(word3) & word3 != "")] )
sum(word1 %in% word2) + sum(word1 %in% word3)
})
[1] 0 0 0 4
There are no same phonemes in first three rows but the 4-th word 'dog' has 2 matching sounds with 'called' (D and O/A) and 2 matches with 'dogma' (D and G). The result is a numeric vector, you can sum() it, divide by nrow(bookdf) or whatever you need.
Are you sure it's working correctly? Isn't that function returning consonanceScore just for the last three words of the book? If the loop's 3rd-to-last-line is
consonanceScore <- sum(word1 %in% word2)
, how is its value being recorded, or influencing later iterations of the loop?
There are several vectorization approaches that will increase your speed, but for something tricky like this, I like making sure the slow loopy way is working correctly first. While you're in that stage of development, here are some suggestions how to make the code quicker and/or neater (which hopefully helps you debug with more clarity).
Short-term suggestions
Inside replaceIfEmpty(), use ifelse(). Maybe even use ifelse() directly inside the main function.
Why is as.character() necessary? That casting can be expensive. Are those columns factors? If so, use , stringsAsFactors=F when you use something like read.csv().
Don't use toupper() three times for each iteration. Just convert the whole thing once before the loop starts.
Similarly, don't execute / length(book) for each iteration. Since it's the same denominator for the whole book, divide the final vector of numerators only once (after the loop's done).
Long-term suggestions
Eventually I think you'll want to lookup each word only once, instead of three times. Those lookups are expensive. Similar to #inscaven 's suggestion, I think an intermediate table make sense (where each row is a book's word).
To produce the intermediate table, you should get much better performance from a join function written and optimized by someone else in C/C++. Consider something like dplyr::left_join(). Maybe book has to be converted to a single-variable data.frame first. Then left join it to the first column of the dictionary. The row's subsequent columns will essentially be appended to the right side of book (which I think is what's happening now).
Once each iteration is quicker and correct, consider using one of the xapply functions, or something in dplyr. The advantage of these functions is that memory for the entire vector isn't destroyed and reallocated for every single word in each book.

how to extract numbers that are in certain position in the character vector within a data frame

I have a csv file looking like this:
data[1,]"0;0;0;0";"0;0;0;0";"1395,387994;0;0;0";"1438,433382;0;0;0";"1477,891654;0;0;0";NA;NA;NA;NA
data[2,]"0;0;0;0";"1129,941435;0;0;0";"1140,702782;0;0;0";"1140,702782;0;0;0";"2415,922401;0;0;0";"2469,729136;0;0;0";"2545,058565;0;0;0";NA;NA
data[3,]"0;0;0;0";"0;0;0;0";"0;0;0;0";"0;0;0;0";"1506,58858;0;0;0";"1506,58858;0;0;0";"1517,349927;0;0;0";"1528,111274;0;0;0";NA
basically its 238 by 581 data frame. What I want is to keep NA's as NA's, to convert "0;0;0;0"'s into NA's and get the first number from objects where their is a non-zero value for the first position like "1506,58858;0;0;0".
result should look like this:
data[1,] NA NA 1395,387994 1438,433382 1140,702782 Na NA NA NA
data[2,] NA 1129,941435 1140,702782 1140,702782 2415,922401 2469,729136 2545,058565 NA NA
data[2,] NA NA NA NA 1506,58858 1506,58858 1517,349927 1528,111274 NA
I read my data like this:
f0=read.table("D:../f0.per.call.csv",sep=";",na.strings =c("NA","0;0;0;0"),stringsAsFactors = FALSE)
I know it is very easy task but I can't figure it out, I keep on getting errors when I am trying to convert characters to numerical values.. Any help will be appreciated,
thanks.
I would do it in 2 steps, after I read the file:
replace "0;0;0;0" by NA
use regular expression to remove "0;0;0;" at the end of some columns
Here is the code I used to replace the "0;0;0":
dat <- read.table("D:../f0.per.call.csv",
sep=";",na.strings =c("NA"),stringsAsFactors = FALSE)
dat[dat=="0;0;0;0"] <- NA
sapply(dat,function(x)gsub("(.*);0;0;0","\\1",x))
V1 V2 V3 V4 V5 V6 V7 V8 V9
[1,] NA NA "1395,387994" "1438,433382" "1477,891654" NA NA NA NA
[2,] NA "1129,941435" "1140,702782" "1140,702782" "2415,922401" "2469,729136" "2545,058565" NA NA
[3,] NA NA NA NA "1506,58858" "1506,58858" "1517,349927" "1528,111274" NA
After reading in your data, you can use strsplit and extract just the first item using lapply/sapply/vapply. Here's an example:
f0 <- read.table("D:../f0.per.call.csv", sep=";",
na.strings = c("NA","0;0;0;0"),
stringsAsFactors = FALSE)
f0[] <- lapply(f0, function(y)
vapply(strsplit(as.character(y), ";"),
function(z) z[[1]], ""))
f0
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 <NA> <NA> 1395,387994 1438,433382 1477,891654 <NA> <NA> <NA> <NA>
# 2 <NA> 1129,941435 1140,702782 1140,702782 2415,922401 2469,729136 2545,058565 <NA> <NA>
# 3 <NA> <NA> <NA> <NA> 1506,58858 1506,58858 1517,349927 1528,111274 <NA>
The result here is a data.frame, just like the input was a data.frame.

What is the difference between cor and cor.test in R

I have a data frame that its columns are different samples of an experiment. I wanted to find the correlation between these samples. So the correlation between sample v2 and v3, between sample v2 and v4, ....
This is the data frame:
> head(t1)
V2 V3 V4 V5 V6
1 0.12725011 0.051021886 0.106049328 0.09378767 0.17799444
2 0.86096784 1.263327211 3.073650624 0.75607466 0.92244361
3 0.45791031 0.520207274 1.526476608 0.67499102 0.49817761
4 0.00000000 0.001139721 0.003158557 0.00000000 0.00000000
5 0.13383965 0.098943019 0.099922146 0.13871867 0.09750611
6 0.01016334 0.010187671 0.025410170 0.00000000 0.02369374
> nrow(t1)
[1] 23367
if I run the cor function for this data frame to get the correlation between samples(columns) I get NA for all the samples:
> cor(t1, method= "spearman")
V2 V3 V4 V5 V6
V2 1 NA NA NA NA
V3 NA 1 NA NA NA
V4 NA NA 1 NA NA
V5 NA NA NA 1 NA
V6 NA NA NA NA 1
but if I run this :
> cor.test(t1[,1],t1[,2], method="spearman")$estimate
rho
0.92394
it is different. Why is this so? What is the correct way of getting correlation between these samples?
Thank you in advance.
Your data contains NA values.
From ?cor:
If use is "everything", NAs will propagate conceptually, i.e., a
resulting value will be NA whenever one of its contributing
observations is NA.
From ?cor.test
na.action a function which indicates what should happen when the data
contain NAs. Defaults to getOption("na.action").
On my system:
getOption("na.action")
[1] "na.omit"
Use which(!is.finite(t1)) to search for problematic values and which(is.na(t1)) to search for NA values. cor returns NaN if you have Inf values in your data.

Resources