Replaing NAs with correlated values in rows - r

Hey All I have data frame with 5 Samples A,B,C,D,E. and what I want to do is firstly search for a mirna which is overall highly correlated with the miRNA having the missing value and taking a value derived from that mirna .. for example
miRNA-1 values: 1 2 3 NA 5
miRNA-2 values: 2 4 6 8 10
==> replace the missing value derived from the second miRNA by 4.
This is what I want to do for my data frame in R
Any help would be really appreciated :)
A B C D
hsa-miR-199a-3p, hsa-miR-199b-3p NA 13.13892 5.533703 25.67405
hsa-miR-365a-3p, hsa-miR-365b-3p 15.70536 52.86558 18.467540 223.51424
hsa-miR-3689a-5p, hsa-miR-3689b-5p NA 21.41597 5.964772 NA
hsa-miR-3689b-3p, hsa-miR-3689c 9.58696 44.56490 10.102051 13.26785
hsa-miR-4520a-5p, hsa-miR-4520b-5p 18.06865 28.06991 NA NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA 10.77471 8.039662 NA
E
hsa-miR-199a-3p, hsa-miR-199b-3p NA
hsa-miR-365a-3p, hsa-miR-365b-3p 31.93503
hsa-miR-3689a-5p, hsa-miR-3689b-5p 24.26073
hsa-miR-3689b-3p, hsa-miR-3689c NA
hsa-miR-4520a-5p, hsa-miR-4520b-5p NA
hsa-miR-516b-3p, hsa-miR-516a-3p NA

Have you had a look at this answer (esp Akrun's short cut from zoo)? I appreciate it's not quite what you want, but might give some leads. It is for means of neighbours in a row, so would suggest 1 2 3 NA 5 would be 4 (average 3 and 5).
Replacing NA's in R numeric vectors with values calculated from neighbours
Trying to find a correlation between pairs with just 4 data points, as one is missing, is a challenge.

Related

Displaying answers on ranking question in R

I have the following variables which are the result of one ranking question. On that question, participants got the 7 listed motivations presented and should rank them. Here, value 1 means the participant put the motivation on position 1, and value 7 means he put it on last position. The ranking is expressed through the numbers on these variables (numbers 1 to 7):
'data.frame': 25 obs. of 8 variables:
$ id : num 8 9 10 11 12 13 14 15 16 17 ...
$ motivation_quantity : num NA 3 1 NA 3 NA NA NA 1 NA ...
$ motivation_quality : num NA 1 6 NA 3 NA NA NA 3 NA ...
$ motivation_timesaving : num NA 6 4 NA 2 NA NA NA 5 NA ...
$ motivation_contribution : num NA 4 2 NA 1 NA NA NA 2 NA ...
$ motivation_alternativelms: num NA 5 3 NA 6 NA NA NA 7 NA ...
$ motivation_inspiration : num NA 2 7 NA 4 NA NA NA 4 NA ...
$ motivation_budget : num NA 7 5 NA 7 NA NA NA 6 NA ...
What I want to do now is to calculate and visualize the results on the ranking question (i.e. visualizing the results on the motivations). Since I havent worked with R for a long time, I am not sure how to best do this.
One way I could imagine is to first calculate the top 3 answers (which are the motivations which were most frequently ranked on position "1", "2", and "3" across participants.
Would really appreciate it if someone could help out with doing this or even show a better way how to analyse and visualize my data.
I originally had an visualization in microsoft forms but this one got destroyed by a bug overnight. It looked like this:
These variables are defined by RStudio as numeric (in statistics terms it refers to continuous variables). The goal is then to convert them into categorical variables (called factors in RStudio).
Let's get to work :
library(dplyr)
library(tidyr)
# lets us first convert the id column into integers so we can apply mutate_if on the other numeric factors and convert all of them into factors (categorical variables), we shall name your dataframe (df)
df$id <- as.integer(df$id)
# and now let's apply mutate_if to convert all the other variables (numeric) into factors (categorical variables).
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7)
# I guess in your case that would be all, but if you wanted the content of the dataframe to be position_1, position_2 ...position_7, we just add labels like this :
df <- df %>% mutate_if(is.numeric,factor,
levels = 1:7,
labels = paste(rep("position",7),1:7,sep="_"))
# For the visualisation now, we need to use the function gather in order to convert the df dataframe into a two column dataframe (and keeping the id column), we shall name this new dataframe df1
df1 <- df %>% gather(key=Questions, value=Answers, motivation_quantity:motivation_budget,-id )
# the df1 dataframe now includes three columns : the id column - the Questions columns - the Answers column.
# we can now apply the ggplot function on the new dataframe for the visualisation
# first the colours
colours <- c("firebrick4","firebrick3", "firebrick1", "gray70", "blue", "blue3" ,"darkblue")
# ATTENTION since there are NAs in your dataframe, either you can recode them as zeros or delete them (for the visualisation) using the subset function within the ggplot function as follows :
ggplot(subset(df1,!is.na(Answers)))+
aes(x=Questions,fill=Answers)+
geom_bar()+
coord_flip()+
scale_fill_manual(values = colours) +
ylab("position_levels")
# of course you can enter many modifications into the visualisation but in total I think that's what you need.

R: how to merge two columns (column addition) while ignoring rows with same value

I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.
It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.

Reformat wrapped data coerced into a dataframe? (R)

I have some data I need to extract from a .txt file in a very weird, wrapped format. It looks like this:
eid nint hisv large NA
1 1.00 1.00000e+00 0 1.0 NA
2 -152552.00 -6.90613e+04 -884198 -48775.7 1151.70
3 -5190.13 4.17751e-05 NA NA NA
4 2.00 1.00000e+00 0 1.0 NA
5 -172188.00 -8.16684e+04 -809131 -56956.1 -1364.07
6 -5480.54 4.01573e-05 NA NA NA
Luckily, I do not need all of this data. I just want to match eid with the value written in scientific notation. so:
eid sigma
1 1 4.17751e-005
2 2 4.01573e-005
3 3 3.72098e-005
This data goes on for hundreds of thousands of eids. It needs to discard the last three values of each first row, all of the values in row 2, and keep the last/second value in the third row. Then place it next to the 1st value of row 1. Then repeat. The column names other than 'eid' are totally disposable, too. I've never had to deal with wrapped data before so don't know where to begin.
**edited to show df after read-in.

Create dataframe with missing data

I'm very new to R, so please excuse my potentially noob question.
I have data from 23 individuals of hormone concentrations collected hourly - I've interpolated between hourly collections to get concentrations between 2.0 - 15pg/ml at intervals of 0.1 : this equals to 131 rows of data per individual.
Some individials' concentrations, however, don't go beyond 6.0 pg/ml (for example) which means I have dataframes of unequal number of rows across individials. I need all individuals to have 131 rows for the next step where I combine all the data.
I've tried to create a dataframe of NAs with 131 rows and two coloumns, and then add the individual's interplotated data into the NA dataframe - so that the end result is a 131 row data from with missing data as NA - but it's not going so well.
interp_saliva_002_x <- as.tibble(matrix(, nrow = 131, ncol = 1))
interp_sequence <- as.numeric(seq(2,15,.1))
interp_saliva_002_x[1] <- interp_sequence
colnames(interp_saliva_002_x)[1] <- "saliva_conc"
test <- left_join(interp_saliva_002_x, interp_saliva_002, by "saliva_conc")
Can you help me to understand where I'm going wrong or is there a more logical way to do this?
Thank you!
Lets assume you have 3 vectors with different lengths:
A<-seq(1,5); B<-seq(2,8); C<-seq(3,5)
Change the length of the vectors to the length that you want (in your case it's 131, I picked 7 for simplicity):
length(A)<-7; length(B)<-7; length(C)<-7 #this replaces all the missing values to NA
Next you can cbind the vectors to a matrix:
m <-cbind(A,B,C)
# A B C
#[1,] 1 2 3
#[2,] 2 3 4
#[3,] 3 4 5
#[4,] 4 5 NA
#[5,] 5 6 NA
#[6,] NA 7 NA
#[7,] NA 8 NA
You can also change your matrix to a dataframe:
df<-as.data.frame(m)

code issue with developing a sentiment analysis scoring model

I am trying to do some sentiment analysis on twitter data. I have a dictionary (afinn_list) which is something like below
good 5
bad -5
awesome 6
I have been able to generate a character variable which contains the location of each matched word. Now I want to generate a score variable which will contain the corresponding score for these matches. I am having hard time coming up with a for loop logic.
class(afinn_list)
[1] "data.frame"
vPosMatches <- match(words, afinn_list$word)
vPosMatches
[1] NA NA NA NA 1104 NA NA NA NA NA NA NA NA NA NA NA NA 1836 NA
I am sorry if the question is too naive. I am just trying to learn sentiment analysis using R.
Sentiment analysis is a complex task. Assuming you have clean up your data from twitter and storing it as 1 word in each cell, I guess what you are lacking now is score your cleaned up data in words with your scoring "dictionary" afinn_list.
Assuming that your words is a afinn_list looks like this
dictionary <-data.frame(grade=c('bad','not good', 'ok', 'good','very good'), score=1:5))
# grade score
1 bad 1
2 not good 2
3 ok 3
4 good 4
5 very good 5
and your mock_data ( clean up data from twitter) is
mock_data<-data.frame(data=rep(x=c('good','bad','rubbish','hello','very good'),10))
# data
1 good
2 bad
3 rubbish
4 hello
5 very good
6 good
You will do a merge between 2 data frame. In SQL world, it will be an left outer join . In R, it is impletemed with the function merge and providing the column you wish to join by and all.x=True
Hence your code will look like this
merge(mock_data, dictionary, by='data', all.x=TRUE)
I hope this answer you question.
Cheers

Resources