Create dataframe with missing data

Create dataframe with missing data - r

I'm very new to R, so please excuse my potentially noob question.
I have data from 23 individuals of hormone concentrations collected hourly - I've interpolated between hourly collections to get concentrations between 2.0 - 15pg/ml at intervals of 0.1 : this equals to 131 rows of data per individual.
Some individials' concentrations, however, don't go beyond 6.0 pg/ml (for example) which means I have dataframes of unequal number of rows across individials. I need all individuals to have 131 rows for the next step where I combine all the data.
I've tried to create a dataframe of NAs with 131 rows and two coloumns, and then add the individual's interplotated data into the NA dataframe - so that the end result is a 131 row data from with missing data as NA - but it's not going so well.
interp_saliva_002_x <- as.tibble(matrix(, nrow = 131, ncol = 1))
interp_sequence <- as.numeric(seq(2,15,.1))
interp_saliva_002_x[1] <- interp_sequence
colnames(interp_saliva_002_x)[1] <- "saliva_conc"
test <- left_join(interp_saliva_002_x, interp_saliva_002, by "saliva_conc")
Can you help me to understand where I'm going wrong or is there a more logical way to do this?
Thank you!

Lets assume you have 3 vectors with different lengths:
A<-seq(1,5); B<-seq(2,8); C<-seq(3,5)
Change the length of the vectors to the length that you want (in your case it's 131, I picked 7 for simplicity):
length(A)<-7; length(B)<-7; length(C)<-7 #this replaces all the missing values to NA
Next you can cbind the vectors to a matrix:
m <-cbind(A,B,C)
# A B C
#[1,] 1 2 3
#[2,] 2 3 4
#[3,] 3 4 5
#[4,] 4 5 NA
#[5,] 5 6 NA
#[6,] NA 7 NA
#[7,] NA 8 NA
You can also change your matrix to a dataframe:
df<-as.data.frame(m)

Related

R: how to merge two columns (column addition) while ignoring rows with same value

I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.

It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you

This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]

I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA

split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

Replaing NAs with correlated values in rows

Hey All I have data frame with 5 Samples A,B,C,D,E. and what I want to do is firstly search for a mirna which is overall highly correlated with the miRNA having the missing value and taking a value derived from that mirna .. for example
miRNA-1 values: 1 2 3 NA 5
miRNA-2 values: 2 4 6 8 10
==> replace the missing value derived from the second miRNA by 4.
This is what I want to do for my data frame in R
Any help would be really appreciated :)
A B C D
hsa-miR-199a-3p, hsa-miR-199b-3p NA 13.13892 5.533703 25.67405
hsa-miR-365a-3p, hsa-miR-365b-3p 15.70536 52.86558 18.467540 223.51424
hsa-miR-3689a-5p, hsa-miR-3689b-5p NA 21.41597 5.964772 NA
hsa-miR-3689b-3p, hsa-miR-3689c 9.58696 44.56490 10.102051 13.26785
hsa-miR-4520a-5p, hsa-miR-4520b-5p 18.06865 28.06991 NA NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA 10.77471 8.039662 NA
E
hsa-miR-199a-3p, hsa-miR-199b-3p NA
hsa-miR-365a-3p, hsa-miR-365b-3p 31.93503
hsa-miR-3689a-5p, hsa-miR-3689b-5p 24.26073
hsa-miR-3689b-3p, hsa-miR-3689c NA
hsa-miR-4520a-5p, hsa-miR-4520b-5p NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA

Have you had a look at this answer (esp Akrun's short cut from zoo)? I appreciate it's not quite what you want, but might give some leads. It is for means of neighbours in a row, so would suggest 1 2 3 NA 5 would be 4 (average 3 and 5).
Replacing NA's in R numeric vectors with values calculated from neighbours
Trying to find a correlation between pairs with just 4 data points, as one is missing, is a challenge.

R Pooled DataFrame analysis

I'm trying to perform several analysis on subsets of data in a dataframe in R, and i was wondering if there is generic way for doing this.
Say, I have a dataframe like:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
how could I apply some computation (e.g. cumulative counting) based upon values in col "one" condition upon (grouped by) the value in col "three".
That is, I wanna do stuff to one column, based upon grouping in another column. I can do this with loops, but I feel there might be standard ways to do this all at once.
thank you in advance!

ddply(data, .(coln), Stat) does the trick exactly

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.

This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create dataframe with missing data - r

Related

R: how to merge two columns (column addition) while ignoring rows with same value

Subsetting in R using a list

Replaing NAs with correlated values in rows

R Pooled DataFrame analysis

Re-sample a data frame with panel dimension

Categories

Resources