Populate multiple columns by values in one column [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I haven't been able to find a solution to this so far... This one came the closest: 1
Here is a small subset of my dataframe, df:
ANIMAL(chr) MARKER(int) GENOTYPE(int)
"1012828" 1550978 0
"1012828" 1550982 2
"1012828" 1550985 1
"1012830" 1550982 0
"1012830" 1550985 2
"1012830" 1550989 2
And what I want is this...
ANIMAL MARKER_1550978 MARKER_1550982 MARKER_1550985 MARKER_1550989
"1012828" 0 2 1 NA
"1012830" NA 0 2 2
My thought, initially was to create columns for each marker according to the referenced question
markers <- unique(df$MARKER)
df[,markers] <- NA
since I can't have integers for column names in R. I added "MARKER_" to each new column so it would work:
df$MARKER <- paste("MARKER_",df$MARKER)
markers <- unique(df$MARKER)
df[,markers] <- NA
Now I have all my new columns, but with the same number of rows. I'll have no problem getting rid of unnecessary rows and columns, but how would I correctly populate my new columns with their correct GENOTYPE by MARKER and ANIMAL? Am guessing one-or-more of these: indexing, match, %in%... but don't know where to start. Searching for these in stackoverflow did not yield anything that seemed pertinent to my challenge.

What you're asking is a very common dataframe operation, commonly called "spreading", or "widening". The inverse of this operation is "gathering". Check out this handy cheatsheet, specifically the part on reshaping data.
library(tidyr)
df %>% spread(MARKER, GENOTYPE)
#> ANIMAL 1550978 1550982 1550985 1550989
#> 1 1012828 0 2 1 NA
#> 2 1012830 NA 0 2 2

Related

Get value of a matrix with row-index and column-index [duplicate]

This question already has answers here:
Index values from a matrix using row, col indices
(4 answers)
Closed 5 years ago.
I'm trying to get the value of a matrix with a vector of row indexes and column indexes like this.
M = matrix(rnorm(100),nrow=10,ncol=10)
set.seed(123)
row_index = sample(10) # 3 8 4 7 6 1 10 9 2 5
column_index = sample(10) # 10 5 6 9 1 7 8 4 3 2
Is there any way I can do something like
M[row_index, column_index]
and get the values for
M[3,10], M[8,5], ...
as a vector?
We need a cbind to create a 2 column matrix where the first column denotes row index and second column index
M[cbind(row_index, column_index)]
The solution I present is not the best way of doing things in R, because in most cases for loops are slow compared to vectorized operations. However, for the problem you can simply implement a loop to index the matrix. While there might be absolutely no reason at all to not specify any object that provide the data structure(such as a data-frame or a matrix), we can avoid it anyway using a loop construct.
for (i in 1:length(row_index)) {
print(M[row_index[i], column_index[i]])
}

R if function over two columns of different length [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I am new to R and programming in general. I have two data frames from which I want to calculate the Win probability from the counts of two different data frames Wins and Losses. I want to check through the list and check whether the values for the score appear in both lists, if they do, I want to perform and operation, if they do not I would like it just to return an NA.
df W df L
score freq score freq
5 10 5 10
10 10 10 5
7 2 3 2
4 1
Here is my function I have written so far:
test <- function(W, L){
if (W$score == L$score) {
total <- W$freq + L$freq
W$freq / total
}
else NA
}
I want the output to be a list of the length of W:
0.5
0.66
NA
NA
This works fine for the first value in the data frame but I get the following error: the condition has length > 1 and only the first element will be used. I have been reading here on StackOverflow that I should use an ifelse function instead as that will loop through all of the rows. However, when I tried this it then had a problem with the two data frame columns being of different lengths. I want to re-use this function over a lot of different data frames and they will always be of different lengths, so I would like a solution for that.
Any help would be much appreciated and I can further clarify myself if it is currently unclear.
Thanks
You need to join these two data frames using merge function like this:
W <- data.frame(score=c(1,2,3), freq=c(5,10,15))
L <- data.frame(score=c(1,2,4), freq=c(2,4,8))
merge(W, L, by=c("score"="score"), all=TRUE)
score freq.x freq.y
1 1 5 2
2 2 10 4
3 3 15 NA
4 4 NA 8
Parameter all set to TRUE means that you want to get all the results from both data frames.

How to split data.frame into smaller data.frames of predetermined number of rows? [duplicate]

This question already has answers here:
The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
(11 answers)
Closed 7 years ago.
I have the following data frame:
df <- data.frame(a=rep(1:3),b=rep(1:3),c=rep(4:6),d=rep(4:6))
df
a b c d
1 1 1 4 4
2 2 2 5 5
3 3 3 6 6
i would like to have a vector N which determines my window size so for thsi example i will set
N <- 1
I would like to split this dataframe into equal portions of N rows and store the 3 resulting dataframes into a list.
I have the following code:
groupMaker <- function(x, y) 0:(x-1) %/% y
testlist2 <- split(df, groupMaker(nrow(df), N))
The problem is that this code renames my column names by adding an X0. in front
result <- as.data.frame(testlist2[1])
result
X0.a X0.b X0.c X0.d
1 1 1 4 4
>
I would like a code that does the exact same thing but keeps the column names as they are. please keep in mind that my original data has a lot more than 3 rows so i need something that is applicable to a much larger dataframe.
To extract a list element, we can use [[. Also, as each list elements are data.frames, we don't need to explicitly call as.data.frame again.
testlist2[[1]]
We can also use gl to create the grouping variable.
split(df, as.numeric(gl(nrow(df), N, nrow(df))))

Excluding values in cross table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
R filtering out a subset
I have an R dataset. In this dataset, I wish to create a crosstable using the package gmodels for two categorial variables, and then run a chisq.test on them.
The two variables are witness and agegroup. witness consists of observations that has value 1,2 and 9. agegroup consists of values 1,2.
I wish to exclude values if witness=9, or/and a 3rd variable EMS=2 from the table but I am not sure how to proceed.
library(gmodels)
CrossTable (mydata$witness, mydata$agegroup)
chisq.test (mydata$witness, mydata$agegroup)
...so my question is, how can i do the above with the conditions that witness!=9 and EMS!=2
data:
witness agegroup EMS
1 1 2
2 2 2
1 1 2
2 1 2
9 2 2
2 2 2
1 2 2
9 2 2
2 1 2
#save the data in your current working directory
data <- read.table("data", header=TRUE, sep = " ")
data$witness[data$witness == "9"] <- NA
mydata <- data[!is.na(data$witness),]
library("gmodels")
CrossTable(mydata$witness, mydata$agegroup, chisq=TRUE)
You can leave the variable "EMS" in "mydata". It does no harm to your analysis!
HTH
I expect this question to be closed as it really seems like a duplicate. But as both Chase and I suggested, I think some form of subsetting is the simplest way to go about this, e.g.
mydata[mydata$witness !=9 & mydata$EMS !=2,]

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Resources