R: Extracting data from a data from for analysis

R: Extracting data from a data from for analysis - r

I am trying to extract data from a data frame for analysis.
heightweight <- function(person, health) {
## Read in data
data <- read.csv("heightweight.csv", header = TRUE,
colClasses = "character")
## Check that the outcomes are valid
measure = c("height", "weight")
if(health %in% measure == FALSE){
stop("Valid inputs are height and weight")
}
## Truncate the data matrix to only what columns are needed
data <- data[c(1, 5, 7)]
## Rename columns
names(data)[1] <- "Name"
names(data)[2] <- "Height"
names(data)[3] <- "Weight"
## Convert numeric columns to numeric
data[, 2] <- as.numeric(data[, 3])
data[, 3] <- as.numeric(data[, 4])
## Convert NAs to 0 after coercion
data[is.na(data)] <- 0
## Check that the name is valid
name <- data[, 1]
name <- unique(name)
if(person %in% name == FALSE){
stop("Invalid person")
}
## Return person with lowest height or weight
list <- data[data$name == person & data[health],]
outcomes <- list[, health]
minumum <- which.min(outcomes)
## Min Rate
minimum[rowNum, ]$name
}
The problem I am having is occurring with
list <- data[data$name == person & data[health],]
That is, I run heightweight("Bob", "weight"), I get the following message
Error in matrix(unlist(value, recursive = FALSE, use.names = FALSE), nrow = nr, :
length of 'dimnames' [2] not equal to array extent
I have Googled this message and checked out some threads here but can't determine what the problem is.

Unless I'm missing something, if you only need the lowest weight or height for a given name, the last three lines of code are a bit redundant.
Here's a simple way to get the minimum health measurement for a given person:
min(data[data$name==person, "height"])
The first part selects only the rows of data that correspond to that person, it acts as a row index. The second part, after the comma, selects only the desired variable (column). Once you have selected the desired data, you look for the minimum in that subset of the data.
An example to illustrate the result:
data<-data.frame(name=as.character(c(rep("carlos",2),rep("marta",3),rep("johny",2),"sara")))
set.seed(1)
data$height <- rnorm(8,68,3)
data$weight <- rnorm(8,160,10)
The corresponding data frame:
name height weight
1 carlos 66.12064 165.7578
2 carlos 68.55093 156.9461
3 marta 65.49311 175.1178
4 marta 72.78584 163.8984
5 marta 68.98852 153.7876
6 johny 65.53859 137.8530
7 johny 69.46229 171.2493
8 sara 70.21497 159.5507
Let's say we want the minimum weight for marta:
person <- "marta"
health <- "weight"
The minimum "weight" for "marta" is,
min(data[data$name==person,health])
which gives the desired result:
[1] 153.7876

Here is the simplified analogue of your function:
heightweight <- function(person,health) {
data.set <- data.frame(names=rep(letters[1:5],each=3),height=171:185,weight=seq(95,81,by=-1))
d1 <- data.set[data.set$name == person,]
d2 <- d1[d1[,health]==min(d1[,health]),]
d2[,c('names',health)]
}
The first line produces a sample data set. The second line selects all records for a given person. The last line finds a record corresponding to the minimum value of health.
heightweight('b','height')
# names height
# 4 b 174

Related

How to count missing values from two columns in R

I have a data frame which looks like this
**Contig_A** **Contig_B**
Contig_0 Contig_1
Contig_3 Contig_5
Contig_4 Contig_1
Contig_9 Contig_0
I want to count how many contig ids (from Contig_0 to Contig_1193) are not present in either Contig_A column of Contig_B.
For example: if we consider there are total 10 contigs here for this data frame (Contig_0 to Contig_9), then the answer would be 4 (Contig_2, Contig_6, Contig_7, Contig_8)

Create a vector of all the values that you want to check (all_contig) which is Contig_0 to Contig_10 here. Use setdiff to find the absent values and length to get the count of missing values.
cols <- c('Contig_A', 'Contig_B')
#If there are lot of 'Contig' columns that you want to consider
#cols <- grep('Contig', names(df), value = TRUE)
all_contig <- paste0('Contig_', 0:10)
missing_contig <- setdiff(all_contig, unlist(df[cols]))
#[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8" "Contig_10"
count_missing <- length(missing_contig)
#[1] 5

by match,
x <- c(0:9)
contigs <- sapply(x, function(t) paste0("Contig_",t))
df1 <- data.frame(
Contig_A = c("Contig_0", "Contig_3", "Contig_4", "Contig_9"),
Contig_B = c("Contig_1", "Contig_5", "Contig_1", "Contig_0")
)
xx <- c(df1$Contig_A,df1$Contig_B)
contigs[is.na(match(contigs, xx))]
[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8"
In your case, just change x as x <- c(0,1193)

validating whether measurements are complete, double or have missing rows in R

I have a dataset which consists of measurements over multiple timepoints for multiple individuals. However, for some of the individuals I have one or two measurements and some even four (standard is 3). See the example data:
data layout
From this example table I would only need the measurements from ID 1 and 2, ID 3 has 3 measurements but timepoint two is measured twice. ID 4 and 5 have missing timepoints.
Is there a way to check in a dataframe wheter for each ID the measurements are complete?
I have written some code for this:
get.completeTimepoints <- function(dataFrame){
unique.ids <- matrix(unique(dataFrame$id))
n.ids <- nrow(unique.ids)
for (i in 1:n.ids) {
temp.vector <- filter(dataFrame, id == unique.ids[i,]) #selects the id and all its measurements
validate.timepoints(temp.vector) #call validate function
}
}
I am currently stuck on how to check if measurements for all three timepoints are present(see function below). Any help would be appreciated
validate.timepoints <- function(dataFrame){
row.df <- nrow(dataFrame)
if(row.df ==3){
#check if all 3 timepoints are present in the dataframe
}
}

I'm not exactly sure about the complexities of your data frame, but for the given data layout, I'm assuming you want a function, when the id is given, returns True if time points are exactly 1, 2 and 3.
If that's the case, following should work:
t = data.frame(ID = c(rep(1,3), rep(2,3), rep(3,3), rep(4,2), rep(5,1)),
tp = c(seq(1,3), seq(1,3), 1,2,2,1,2,1),
v1 = rnorm(12,0,1), v2 = rnorm(12,0,1), v3 = rnorm(12,0,1))
ids = seq(1,5)
validate.timepoints = function(df, id){
return(seq(1,3) %in% t$tp[which(t$ID == id)])
}
check = sapply(ids, function(x) all(validate.timepoints(t, x)))
Which returns:
> check
[1] TRUE TRUE FALSE FALSE FALSE

(Pearson's) Correlation loop through the data frame

I have a data frame with 159 obs and 27 variables, and I want to correlate all 159 obs from column 4 (variable 4) with each one of the following columns (variables), this is, correlate column 4 with 5, then column 4 with 6 and so on... I've been unsuccessfully trying to create a loop, and since I'm a beginner in R, it turned out harder than I thought. The reason why I want to turn it more simple is that I would need to do the same thing for a couple more data frames and if I had a function that could do that, it would be so much easier and less time-consuming. Thus, it would be wonderful if anyone could help me.
df <- ZEB1_23genes # CHANGE ZEB1_23genes for df (dataframe)
for (i in colnames(df)){ # Check the class of the variables
print(class(df[[i]]))
}
print(df)
# Correlate ZEB1 with each of the 23 genes accordingly to Pearson's method
cor.test(df$ZEB1, df$PITPNC1, method = "pearson")
### OR ###
cor.test(df[,4], df[,5])
So I can correlate individually but I cannot create a loop to go back to column 4 and correlate it to the next column (5, 6, ..., 27).
Thank you!

If I've understood your question correctly, the solution below should work well.
#Sample data
df <- data.frame(matrix(data = sample(runif(100000), 4293), nrow = 159, ncol = 27))
#Correlation function
#Takes data.frame contains columns with values to be correlated as input
#The column against which other columns must be correlated cab be specified (start_col; default is 4)
#The number of columns to be correlated against start_col can also be specified (end_col; default is all columns after start_col)
#Function returns a data.frame containing start_col, end_col, and correlation value as rows.
my_correlator <- function(mydf, start_col = 4, end_col = 0){
if(end_col == 0){
end_col <- ncol(mydf)
}
#out_corr_df <- data.frame(start_col = c(), end_col = c(), corr_val = c())
out_corr <- list()
for(i in (start_col+1):end_col){
out_corr[[i]] <- data.frame(start_col = start_col, end_col = i, corr_val = as.numeric(cor.test(mydf[, start_col], mydf[, i])$estimate))
}
return(do.call("rbind", out_corr))
}
test_run <- my_correlator(df, 4)
head(test_run)
# start_col end_col corr_val
# 1 4 5 -0.027508521
# 2 4 6 0.100414199
# 3 4 7 0.036648608
# 4 4 8 -0.050845418
# 5 4 9 -0.003625019
# 6 4 10 -0.058172227
The function basically takes a data.frame as an input and spits out (as output) another data.frame containing correlations between a given column from the original data.frame against all subsequent columns. I do not know the structure of your data, and obviously, this function will fail if it runs into unexpected conditions (for instance, a column of characters in one of the columns).

Iterate a data frame containing lists of column numbers, of different lengths, with a function in R

I have a data frame (df) of survey responses about human values with 57 columns/variables of numerical/scale responses. Each column belongs to one of ten categories, and they're not in contiguous groups.
I have a second dataframe (scoretable) that associates the categories with the column numbers for the variables; the lists of column numbers are all different lengths:
scoretable <- data.frame(
valuename =
c("Conformity","Tradition","Benevolence","Universalism","Self-
Direction","Stimulation","Hedonism","Achievement","Power","Security"),
valuevars = I(list(c(11,20,40,47), # Conformity
c(18,32,36,44,51), # Tradition
c(33,45,49,52,54), # Benevolence
c(1,17,24,26,29,30,35,38), # Universalism
c(5,16,31,41,53), # Self-Direction
c(9,25,37), # Stimulation
c(4,50,57), # Hedonism
c(34,39,43,55), # Achievement
c(3,12,27,46), # Power
c(8,13,15,22,56))), # Security
stringsAsFactors=FALSE)
I would like to iterate through scoretable with a function, valuescore, that calculates the mean and sd of all responses in that group of columns in data frame df and write the result to a third table of results:
valuescore = function(df,scoretable,valueresults){
valuename = scoretable[,1]
set <- df[,scoretable[,2]]
setmeans <- colMeans(set,na.rm=TRUE)
valuemean <- mean(setmeans)
setvars <- apply(set, 2, var)
valuesd <-sqrt(mean(setvars))
rbind(valueresults,c(valuename, valuemean, valuesd))
}
a <- nrow(scoretable)
for(i in 1:a){
valuescore(df,scoretable[i,],valueresults)
}
I am very new to R and programming in general (this is my first question here), and I'm struggling to determine how to pass list variables to functions and/or as address ranges for data frames.

Let's create an example data.frame:
df <- replicate(57, rnorm(10, 50, 20)) %>% as.data.frame()
Let's prepare the table result format:
valueresults <- data.frame(
name = scoretable$valuename,
mean = 0
)
Now, a loop on the values of scoretable, a mean calculation by column and then the mean of the mean. It is brutal (first answer with Map is more elegant), but maybe it is easier to understand for a R beginnner.
for(v in 1:nrow(scoretable)){
# let's suppose v = 1 "Conformity"
columns_id <- scoretable$valuevars[[v]]
# isolate columns that correspond to 'Conformity'
temp_df <- df[, columns_id]
# mean of the values of these columns
temp_means <- apply(temp_df, 2, mean)
mean <- mean(temp_means)
# save result in the prepared table
valueresults$mean[v] <- mean
}
> (valueresults)
name mean
1 Conformity 45.75407
2 Tradition 52.76935
3 Benevolence 50.81724
4 Universalism 51.04970
5 Self-Direction 55.43723
6 Stimulation 52.15962
7 Hedonism 53.17395
8 Achievement 47.77570
9 Power 52.61731
10 Security 54.07066

Here is a way using Map to apply a function to the list scoretable[, 2].
First I will create a test df.
set.seed(1234)
m <- 100
n <- 57
df <- matrix(sample(10, m*n, TRUE), nrow = m, ncol = n)
df <- as.data.frame(df)
And now the function valuescore.
valuescore <- function(DF, scores){
f <- function(inx) mean(as.matrix(DF[, inx]), na.rm = TRUE)
res <- Map(f, scores[, 2])
names(res) <- scores[[1]]
res
}
valuescore(df, scoretable)
#$Conformity
#[1] 5.5225
#
#$Tradition
#[1] 5.626
#
#$Benevolence
#[1] 5.548
#
#$Universalism
#[1] 5.36125
#
#$`Self-Direction`
#[1] 5.494
#
#$Stimulation
#[1] 5.643333
#
#$Hedonism
#[1] 5.546667
#
#$Achievement
#[1] 5.3175
#
#$Power
#[1] 5.41
#
#$Security
#[1] 5.54

Better way to improve the for loop for my case in R?

Prob stat :
data set holds two columns mstr_program_list, Loc_cat with 600000. Loc_cat column holds both missing and non missing cells. Other columns are not havings NA's. for each prog in mstr_program_list, need to find total number of loc-cat associated with that program, % of non missing rows and among non missing rows find count of categories its divided into.
Ex : for Unknown prog - total number of rows = 3, non missing rows in loc_cat is one therefore % is (2/3)*100 and number of categories divided into is two (Rests:full) (Rests:lim)
> head(data)
L.Name mstr_program_list loc_cat
1 Six J'SGroup Unknown <NA>
2 Bj's- Maine Roasted Tomat Rests: Full
3 Bj's- Maine Unknown Rests: Full
4 Brad's Q Q Unknown Rests: lim
expected output:
mstr_prog total_count %good(non missing rows) Number of loc_cat
Unknown 3 66.7 2
the code below is taking a lot of time. In fact results are not showing. Can anyone help me to improve the code. Prob with this code as per my view is the adding vectors.
Upon research I came to know to add values to vector not to use append and go with c()
v <- c(v, 'y') # adding elements into a vector
Code:
data <- read.csv("MgData.csv",header=T, na.strings="", colClasses = classes, nrows = 600338,comment.char="") ## import data.
data_NoNull <- na.omit(data)
mpl_unique <- unique(data$mstr_program_list)
mas_Prog_List <- as.character()
loc_Count <- as.numeric()
per_Seg <- as.numeric()
num_Seg <- as.numeric()
for(i in 1:length(mpl_unique)) {
l_t <- length(data$mstr_program_list[data$mstr_program_list == i]) # loc_cat specific to prog
l_g <- length(data_NoNull$mstr_program_list[data_NoNull$mstr_program_list == i]) ## to know filled ones excluding empty
s <- subset(data_NoNull, mstr_program_list==i, select =c(loc_cat))
if((any(i == mas_Prog_List)) == FALSE) {
no_Seg <- nrow(unique(s))
mas_Prog_List <- c(mas_prog_list, i) # Adding values to vector
loc_Count <- c(loc_count, l_t)
perct_Seg <- ((l_g/l_t)*100)
per_Seg <- c(per_Seg, perct_seg)
Num_Seg <- c(Num_Seg, no_seg)
}
}
}
Seg_analysis <- data.frame(mas_Prog_List, loc_Count, per_Seg, num_Seg)
I am new to R. Correct me with changes in the code, naming convention/ terminology used.
Thanks

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Extracting data from a data from for analysis - r

Related

How to count missing values from two columns in R

validating whether measurements are complete, double or have missing rows in R

(Pearson's) Correlation loop through the data frame

Iterate a data frame containing lists of column numbers, of different lengths, with a function in R

Better way to improve the for loop for my case in R?

Categories

Resources