This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I am new to R and programming in general. I have two data frames from which I want to calculate the Win probability from the counts of two different data frames Wins and Losses. I want to check through the list and check whether the values for the score appear in both lists, if they do, I want to perform and operation, if they do not I would like it just to return an NA.
df W df L
score freq score freq
5 10 5 10
10 10 10 5
7 2 3 2
4 1
Here is my function I have written so far:
test <- function(W, L){
if (W$score == L$score) {
total <- W$freq + L$freq
W$freq / total
}
else NA
}
I want the output to be a list of the length of W:
0.5
0.66
NA
NA
This works fine for the first value in the data frame but I get the following error: the condition has length > 1 and only the first element will be used. I have been reading here on StackOverflow that I should use an ifelse function instead as that will loop through all of the rows. However, when I tried this it then had a problem with the two data frame columns being of different lengths. I want to re-use this function over a lot of different data frames and they will always be of different lengths, so I would like a solution for that.
Any help would be much appreciated and I can further clarify myself if it is currently unclear.
Thanks
You need to join these two data frames using merge function like this:
W <- data.frame(score=c(1,2,3), freq=c(5,10,15))
L <- data.frame(score=c(1,2,4), freq=c(2,4,8))
merge(W, L, by=c("score"="score"), all=TRUE)
score freq.x freq.y
1 1 5 2
2 2 10 4
3 3 15 NA
4 4 NA 8
Parameter all set to TRUE means that you want to get all the results from both data frames.
Related
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I have found questions that are similar to what I want to do but not the exact same.
I am working in R. I have two dataframes I want to combine. The only issue is that there are more observations in one dataframe than the other. (The data I have is proprietary so I'll make up some data to show you.) Let's say dataframe A has 450 observations and dataframe B has 500 observations.
Both dataframes have a variable that identifies a unique person. Lets say it’s a social security number. So there exist people who are in both dataframe A and dataframe B. But there are some people who exist in one and not the other. I want to keep the rows of people who are in both dataframes and eliminate the people who are in only one dataframe and not the other. To illustrate with fake data on a smaller scale...
Dataframe A
SSID Age Wage
[1]12345 23 45645
[2]15461 45 534688
[3]12458 12 475412
[4]68741 63 124
[5]36987 91 458746
Dataframe B
SSID Education Race
[1]12345 2 8
[2]15461 3 4
[3]89512 1 3
[4]68741 2 7
[5]99423 0 8
[6]79225 1 4
[7]66598 3 2
Dataframe C (what I want)
SSID Age Wage Education Race
[1]12345 23 45645 2 8
[2]15461 45 534688 3 4
[3]68741 63 124 2 7
So only the common rows, pertaining to the SSID variable, are preserved, and everything else is trashed. How can I do this?
I tried doing stuff like C = which(B$SSID %in% A$SSID) but to no avail.
I believe what you're looking for is an inner_join available in the dplyr package:
library(dplyr)
dataframe_c <- inner_join(dataframe_a, dataframe_b, by = "SSID")
Or you can use merge from base:
dataframe_c <- merge(dataframe_a, dataframe_b, by = "SSID", all = FALSE)
This question already has answers here:
Index values from a matrix using row, col indices
(4 answers)
Closed 5 years ago.
I'm trying to get the value of a matrix with a vector of row indexes and column indexes like this.
M = matrix(rnorm(100),nrow=10,ncol=10)
set.seed(123)
row_index = sample(10) # 3 8 4 7 6 1 10 9 2 5
column_index = sample(10) # 10 5 6 9 1 7 8 4 3 2
Is there any way I can do something like
M[row_index, column_index]
and get the values for
M[3,10], M[8,5], ...
as a vector?
We need a cbind to create a 2 column matrix where the first column denotes row index and second column index
M[cbind(row_index, column_index)]
The solution I present is not the best way of doing things in R, because in most cases for loops are slow compared to vectorized operations. However, for the problem you can simply implement a loop to index the matrix. While there might be absolutely no reason at all to not specify any object that provide the data structure(such as a data-frame or a matrix), we can avoid it anyway using a loop construct.
for (i in 1:length(row_index)) {
print(M[row_index[i], column_index[i]])
}
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I haven't been able to find a solution to this so far... This one came the closest: 1
Here is a small subset of my dataframe, df:
ANIMAL(chr) MARKER(int) GENOTYPE(int)
"1012828" 1550978 0
"1012828" 1550982 2
"1012828" 1550985 1
"1012830" 1550982 0
"1012830" 1550985 2
"1012830" 1550989 2
And what I want is this...
ANIMAL MARKER_1550978 MARKER_1550982 MARKER_1550985 MARKER_1550989
"1012828" 0 2 1 NA
"1012830" NA 0 2 2
My thought, initially was to create columns for each marker according to the referenced question
markers <- unique(df$MARKER)
df[,markers] <- NA
since I can't have integers for column names in R. I added "MARKER_" to each new column so it would work:
df$MARKER <- paste("MARKER_",df$MARKER)
markers <- unique(df$MARKER)
df[,markers] <- NA
Now I have all my new columns, but with the same number of rows. I'll have no problem getting rid of unnecessary rows and columns, but how would I correctly populate my new columns with their correct GENOTYPE by MARKER and ANIMAL? Am guessing one-or-more of these: indexing, match, %in%... but don't know where to start. Searching for these in stackoverflow did not yield anything that seemed pertinent to my challenge.
What you're asking is a very common dataframe operation, commonly called "spreading", or "widening". The inverse of this operation is "gathering". Check out this handy cheatsheet, specifically the part on reshaping data.
library(tidyr)
df %>% spread(MARKER, GENOTYPE)
#> ANIMAL 1550978 1550982 1550985 1550989
#> 1 1012828 0 2 1 NA
#> 2 1012830 NA 0 2 2
This question already has answers here:
The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
(11 answers)
Closed 7 years ago.
I have the following data frame:
df <- data.frame(a=rep(1:3),b=rep(1:3),c=rep(4:6),d=rep(4:6))
df
a b c d
1 1 1 4 4
2 2 2 5 5
3 3 3 6 6
i would like to have a vector N which determines my window size so for thsi example i will set
N <- 1
I would like to split this dataframe into equal portions of N rows and store the 3 resulting dataframes into a list.
I have the following code:
groupMaker <- function(x, y) 0:(x-1) %/% y
testlist2 <- split(df, groupMaker(nrow(df), N))
The problem is that this code renames my column names by adding an X0. in front
result <- as.data.frame(testlist2[1])
result
X0.a X0.b X0.c X0.d
1 1 1 4 4
>
I would like a code that does the exact same thing but keeps the column names as they are. please keep in mind that my original data has a lot more than 3 rows so i need something that is applicable to a much larger dataframe.
To extract a list element, we can use [[. Also, as each list elements are data.frames, we don't need to explicitly call as.data.frame again.
testlist2[[1]]
We can also use gl to create the grouping variable.
split(df, as.numeric(gl(nrow(df), N, nrow(df))))
This question already has answers here:
Create a Data Frame of Unequal Lengths
(6 answers)
Closed 9 years ago.
I have the following elementary issue in R.
I have a for (k in 1:x){...} cycle which produces numerical vectors whose length depends on k.
For each value of k I produce a single numerical vector.
I would like to collect them as rows of a data frame in R, if possible. In other words, I would like to introduce a data frame data s.t.
for (k in 1:x) {
data[k,] <- ...
}
where the dots represent the command producing the vector with length depending on k.
Unfortunately, as far as I know, the length of the rows of a dataframe in R is constant, as it is a list of vectors of equal length. I have already tried to complete each row with a suitable number of zeroes to arrive at a constant length (in this case equal to x). I would like to work "dynamically", instead.
I do not think that this issue is equivalent to merge vectors of different lengths in a dataframe; due to the if cycle, only 1 vector is known at each step.
Edit
A very easy example of what I mean. For each k, I would like to write the vector whose components are 1,2,...,k and store it as kth row of the dataframe data. In the above setting, I would write
for (k in 1:x) {
data[k,] <- seq(1,k,1)
}
As the length of seq(1,k,1) depends on k the code does not work.
You could consider using ldply from plyr here.
set.seed(123)
#k is the length of each result
k <- sample( 5 , 3 , repl = TRUE )
#[1] 2 4 3
# Make a list of vectors, each a sequence from 1:k
ll <- lapply( k , function(x) seq_len(x) )
#[[1]]
#[1] 1 2
#[[2]]
#[1] 1 2 3 4
#[[3]]
#[1] 1 2 3
# take our list and rbind it into a data.frame, filling in missing values with NA
ldply( ll , rbind)
# 1 2 3 4
#1 1 2 NA NA
#2 1 2 3 4
#3 1 2 3 NA