For loop is adding an extra value/element - r

Here is the sample dataframe.
I have a function which uses a for loop to go through a dataframe for a specified number of columns, remove NA values, remove duplicate values, then return the length of the final vector which has all the unique values present in the specified columns. The columns represent time, and the goal is to show how many total unique values have existed up until a certain point in time. Here's the sample matrix:
X1 X2 X3 X4 X5 X6
1 F F F F F F
2 C C C C C C
3 D D D D D D
4 A# A# A# A A A
5 <NA> <NA> <NA> G G <NA>
And here's the function:
uniquepitches <- function(file, col){
y <- read.csv(file, na.strings=c(""))
frame <- data.frame(y)
x <- c()
for(i in 1:col) {
noNAframe <- frame[!is.na(frame[, 1:i])]
x[i] <- length(unique(noNAframe))
}
x
}
The issue is that when I run it for any value for col, I get the wrong values. For example, uniquepitches("testnotes.csv", 1) gives me 5, which should be 4. uniquepitches("testnotes.csv", 6) gives me [1] 5 5 5 6 6 6, which should be [1] 4 4 4 6 6 6. So as of right now it looks like the x vector has one element too many in the first three run-throughs, which is why the length is one too many. How can I fix it so that it's the correct length?

This task can be accomplished with sapply():
df <- data.frame(X1=c('F','C','D','A#',NA), X2=c('F','C','D','A#',NA), X3=c('F','C','D','A#',NA), X4=c('F','C','D','A','G'), X5=c('F','C','D','A','G'), X6=c('F','C','D','A',NA) );
sapply(df, function(c) length(unique(c[!is.na(c)])) );
## X1 X2 X3 X4 X5 X6
## 4 4 4 5 5 4
Edit: #Molx might be correct, although the OP needs to clarify to be sure. If the requirement is indeed to process the cumulative column content, rather than each individual column in isolation, then you can do this:
sapply(1:ncol(df), function(c) length(unique(df[,1:c][!is.na(df[,1:c])])) );
## [1] 4 4 4 6 6 6
Edit: Sorry, I should've been clearer. The sapply() call replaces the entire for loop. So the function can be rewritten as follows:
uniquepitches <- function(file,col) {
frame <- read.csv(file,na.strings=c(""));
sapply(1:col, function(c) length(unique(frame[,1:c][!is.na(frame[,1:c])])) );
}
(Also notice that read.csv() returns a data.frame, so there's no need for manual coercion.)

Related

R - Merging and aligning two CSVs using common values in multiple columns

I currently have two .csv files that look like this:
File 1:
Attempt
Result
Intervention 1
B
Intervention 2
H
and File 2:
Name
Outcome 1
Outcome 2
Outcome 3
Sample 1
A
B
C
Sample 2
D
E
F
Sample 3
G
H
I
I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.
Ideally, would look like this:
Attempt
Result
Name
Outcome 1
Outcome 2
Outcome 3
Intervention 1
B
Sample 1
A
B
C
Sample 2
D
E
F
Intervention 2
H
Sample 3
G
H
I
I've looked and only found answers when merging two .csv files with one common column. Any help would be very appreciated.
I will assume that " Result " in File 1 is unique, since more File 1 rows with same result value (i.e "B") will force us to consider new columns in the final data frame.
By this way,
Attempt <- c("Intervention 1","Intervention 2")
Result <- c("B","H")
df1 <- as.data.frame(cbind(Attempt,Result))
one <- c("Sample 1","A","B","C")
two <- c("Sample 2","D","E","F")
three <- c("Sample 3","G","H","I")
df2 <- as.data.frame(rbind(one,two,three))
row.names(df2) <- 1:3
colnames(df2) <- c("Name","Outcome 1","Outcome 2","Outcome 3")
vec_at <- rep(NA,nrow(df2));vec_res <- rep(NA,nrow(df2)); # Define NA vectors
for (j in 1:nrow(df2)){
a <- which(is.element(df1$Result,df2[j,2:4])==TRUE) # Row names which satisfy same element in two dataframes?
if (length(a>=1)){ # Don't forget that "a" may not be a valid index if no element satify the condition
vec_at[j] <- df1$Attempt[a] #just create a vector with wanted information
vec_res[j] <- df1$Result[a]
}
}
desired_df <- as.data.frame(cbind(vec_at,vec_res,df2)) # define your wanted data frame
Output:
vec_at vec_res Name Outcome 1 Outcome 2 Outcome 3
1 Intervention 1 B Sample 1 A B C
2 <NA> <NA> Sample 2 D E F
3 Intervention 2 H Sample 3 G H I
I wonder if you could use fuzzyjoin for something like this.
Here, you can provide a custom function for matching between the two data.frames.
library(fuzzyjoin)
fuzzy_left_join(
df2,
df1,
match_fun = NULL,
multi_by = list(x = paste0("Outcome_", 1:3), y = "Result"),
multi_match_fun = function(x, y) {
y == x[, "Outcome_1"] | y == x[, "Outcome_2"] | y == x[, "Outcome_3"]
}
)
Output
Name Outcome_1 Outcome_2 Outcome_3 Attempt Result
1 Sample_1 A B C Intervention_1 B
2 Sample_2 D E F <NA> <NA>
3 Sample_3 G H I Intervention_2 H

Combine/match/merge vectors by row names

I have a vector of variable names and several matrices with single rows.
I want to create a new matrix. The new matrix is created by match/merge the row names of the matrices with single rows.
Example:
A vector of variable names
Complete_names <- c("D","C","A","B")
Several matrices with single rows
Matrix_1 <- matrix(c(1,2,3),3,1)
rownames(Matrix_1) <- c("D","C","B")
Matrix_2 <- matrix(c(4,5,6),3,1)
rownames(Matrix_1) <- c("A","B","C")
Desired output:
Desired_output <- matrix(c(1,2,NA,3,NA,6,4,5),4,2)
rownames(Desired_output) <- c("D","C","A","B")
[,1] [,2]
D 1 NA
C 2 6
A NA 4
B 3 5
I know there are several similar postings like this, but those previous answers do not work perfectly for this one.
The main job can be done with merge, returning a data frame:
merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
# Row.names V1.x V1.y
# 1 A NA 4
# 2 B 3 5
# 3 C 2 6
# 4 D 1 NA
Depending on your purposes you may then further modify names or get rid of Row.names.
The answers offered by Julius Vainora and achimneyswallow work well, but just to exactly obtain the desired output I want:
temp <- merge(Matrix_1, Matrix_2, by = "row.names", all = TRUE)
temp$Row.names <- factor(temp$Row.names, levels=Complete_names)
temp <- temp[order(temp$Row.names),]
rownames(temp) <- temp[,1]
Desired_output <- as.matrix(temp[,-1])
V1.x V1.y
D 1 NA
C 2 6
A NA 4
B 3 5

Vector to dataframe with variable row length [duplicate]

This question already has answers here:
Convert Rows into Columns by matching string in R
(3 answers)
Closed 4 years ago.
Given a vector, I want to convert it to a dataframe using a 'key' value which is randomly distributed throughout the vector at the start of what is to be a row. In this case, "z" would be the first value in each column.
vd <- c("z","a","b","c","z","a","b","c","z","a","b","c","d")
The resultant data should look like:
#using magrittr
data.frame(x1 = c("z","a","b","c", NA), x2 = c("z","a","b","c", NA), x3 = c("z","a","b","c","d"))
%>% transpose()
One solution would be to find the largest distance between 'keys' in the vector and then interject blank values at the end of 'sections' that are smaller than the longest 'section' so you could use matrix()
What would be the best way to do this?
plyr::ldply(split(vd, cumsum(vd == "z")), rbind)[-1]
(copied from here)
result:
1 2 3 4 5
1 z a b c <NA>
2 z a b c <NA>
3 z a b c d
We can use cumsum to identify groups then split them. Then we append the vectors and format them as a data.frame.
x <- split(vd,cumsum("z"==vd))
maxl <- max(lengths(x))
as.data.frame(lapply(x,function(y) c(y,rep(NA,maxl-length(y)))))
# X1 X2 X3
# 1 z z z
# 2 a a a
# 3 b b b
# 4 c c c
# 5 <NA> <NA> d

order() not doing its job

This is driving me nuts. I am trying to sort a data frame by the first row in ascendinging order using the order function. Below a minimal example:
values <- c(11,10,9,8,7,6,5,4,3,2,1)
labels <- c("A","B","C","D","E","F","G","H","I","J","K")
df <- data.frame(rbind(values,labels))
newdf <- df[,with(df,order(df[1,]))]
print(newdf)
I have also tried this with
newdf <- df[,order(df[1,])]
Here is the output I'm getting
X11 X2 X1 X10 X9 X8 X7 X6 X5 X4 X3
values 1 10 11 2 3 4 5 6 7 8 9
labels K B A J I H G F E D C
Which is clearly wrong! So what is going on here?
This is an odd way to structure your data in R, so it will cause headaches, but you can make it work. See #thelatemail 's comment re: columns vs rows. To make this work in your case, do:
values <- c(11,10,9,8,7,6,5,4,3,2,1)
labels <- c("A","B","C","D","E","F","G","H","I","J","K")
df <- data.frame(rbind(values,labels), stringsAsFactors = FALSE)
newdf <- df[order(as.numeric(df["values",]))]
newdf
# X11 X10 X9 X8 X7 X6 X5 X4 X3 X2 X1
# values 1 2 3 4 5 6 7 8 9 10 11
# labels K J I H G F E D C B A
Note, in particular, stringsAsFactors = FALSE when you create the data frame.
Remember, data.frames are lists, and each element of the list is a vector (possibly a list, but typically an atomic vector, especially if constructed in a standard way) of the same length. The individual elements of the data frame are columns. Rows are just the nested elements with the same index value. This makes it much easier to work with a data frame like this:
df <- data.frame(values = values, labels = labels)
df[order(df$values),]
# values labels
# 11 1 K
# 10 2 J
# 9 3 I
# 8 4 H
# 7 5 G
# 6 6 F
# 5 7 E
# 4 8 D
# 3 9 C
# 2 10 B
# 1 11 A
Here you don't have to worry at all about whether your numbers are going to be coerced to characters and/or factors when you line them up with another vector that's character. In this example, whether or not labels was a factor had no impact on values.

Counting unique values across a row

I want to check that columns are consistent for each ID number (they're supposed to be constants, but there may be some doubt in the data, so I want to double check)
For example, given the following data frame:
test <- data.frame(ID = c("one","two","three"),
a = c(1,1,1),
b = c(1,1,1),
t = c(NA,1,1),
d = c(2,4,1))
I want to check that columns a,b,c and d are all the same, disregarding missing values. I thought I could do this by counting the unique values in the relevant columns, so then I can select only the rows where the number of unique values is more than 1... I imagine this is likely not the best way of doing that, but it was the only way I could think with my limited knowledge.
I found this question here, which seems to be similar to what I want to do:
Find unique values across a row of a data frame
But I am struggling to apply the answers to my data. I have tried this, which didn't do anything (but I've never used a for-loop before, so I've probably done that wrong), although when I run the inside of the function on it's own for a single row it does exactly what I hope for:
yeartest <- function(x){
temp <- test[x,2:5]
temp <- as.numeric(temp)
veclength <- length(unique(temp[!is.na(temp)]))
temp2 <- c(temp,veclength)
test[,"thing"] <- NA
test[x,2:6] <- temp2
}
for(i in 1:nrow(test)){
yeartest(i)
}
Then I tried from the accepted answer, to apply that:
x <- test
# dups <- function(x) x[!duplicated(x)]
yeartest <- function(x){
# x <- 1
temp <- test[x,2:5]
temp <- as.numeric(temp)
veclength <- length(unique(temp[!is.na(temp)]))
temp2 <- c(temp,veclength)
test[,"thing"] <- NA
test[x,2:6] <- temp2
}
new.df <- t(apply(x, 1, function(x) yeartest(x)))
Which gives an error and so it is pretty obvious that I have made a mistake in my translation of the answer to my data.
Apologies, this must be a really obvious failing on my part, I am very grateful for any help.
Solution: (thank you for the help!)
test$new <- apply(test[,2:5],1,function(r) length(unique(na.omit(r))))
> df <- data.frame(
a=sample(2,10,replace=TRUE),
b=sample(2,10,replace=TRUE),
c=sample(c("a","b"),10,replace=TRUE),
d=sample(c("a","b"),10,replace=TRUE))
> df[c(3,6,8),1] <- NA
> df
a b c d
1 1 2 a b
2 1 2 a b
3 NA 2 a a
4 2 2 a b
5 1 2 a a
6 NA 1 a b
7 2 1 b b
8 NA 1 a a
9 1 1 b b
10 2 2 b b
> apply(df,1,function(r) length(unique(na.omit(r))))
[1] 3 3 2 4 3 2 4 2 3 3

Resources