Comparing two columns - r

I am new to R and I am trouble with a command that I did all the time in Python.
I have two data-frames (database and creditIDs), and what I want to do is compare one column in database and one column in creditIDs. More specifically in a value exists in creditIDs[,1] but doesn't in database[,5], I want to delete that entire row in database.
Here is the code:
for (i in 1:lengthColumns){
if (!(database$credit_id[i] %in% creditosVencidos)){
database[i,]<-database[-i,]
}
}
But I keep on getting this error:
50: In `[<-.data.frame`(`*tmp*`, i, , value = structure(list( ... :
replacement element 50 has 9696 rows to replace 1 rows
Could someone explain why this is happening? Thanks!

the which() command will return the row indices that satisfy a boolean statement, much like numpy.where() in python. Using the $ after a dataframe with a column name gives you a vector of that column... alternatively you could do d[,column_number].
In this example I'm creating an x and y column which share the first five values, and use which() to slice the dataframe on their by-row equality:
L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = rep(1:5, 2), y = 1:10, fac = fac))
d = d[which(d$x == d$y),]
d
x y fac
1 1 A
2 2 B
3 3 C
4 4 B
5 5 B

You will need to adjust this for your column names/numbers.
# Create two example data.frames
creditID <- data.frame(ID = c("896-19", "895-8", "899-1", "899-5"))
database <- data.frame(ID = c("896-19", "camel", "899-1", "goat", "899-1"))
# Method 1
database[database$ID %in% creditID$ID, ]
# Method 2 (subset() function)
database <- subset(database, ID %in% creditID$ID)

Related

Assign value in dataframe from list by list's element name = dataframe row number

I have a name list, such as the following:
> myNamedList
(...)
$`1870`
[1] 84.24639
$`1871`
[1] 84.59707
(...)
I would like to assign these values in a dataframe's column where the list element's name corresponds to the dataframe's row number. For now I am proceeding like this:
for (element in names(myNamedList)) {
targetDataFrame[as.numeric(element),][[columnName]] = myNamedList[[element]]
}
This is quite slow if the list is somewhat large, and also not very R-esque. I believe I could do something with apply, but am not sure where to look. Appreciate your help.
Add a row number to original data, then stack the list, then merge. See example:
# example
#data
set.seed(1); d <- data.frame(x = sample(LETTERS, 5))
#named list
x <- list("2" = 11, "4" = 22)
#add a row number
d$rowID = seq(nrow(d))
# stack the list, and merge
merge(d, stack(x), by.x = "rowID", by.y = "ind", all.x = TRUE)
# rowID x values
# 1 1 Y NA
# 2 2 D 11
# 3 3 G NA
# 4 4 A 22
# 5 5 B NA

Create dataframe from smallest vector available

I want to create a dataframe from a list of dataframes, specifically from a certain column of those dataframes. However each dataframe contains a different number of observations, so the following code gives me an error.
diffs <- data.frame(sensor1 = sensores[[1]]$Diff,
sensor2 = sensores[[2]]$Diff,
sensor3 = sensores[[3]]$Diff,
sensor4 = sensores[[4]]$Diff,
sensor5 = sensores[[5]]$Diff)
The error:
Error in data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, :
arguments imply differing number of rows: 29, 19, 36, 26
Is there some way to force data.frame() to take the minimal number or rows available from each one of the columns, in this case 19?
Maybe there is a built-in function in R that can do this, any solution is appreciated but I'd love to get something as general and as clear as possible.
Thank you in advance.
I can think of two approaches:
Example data:
df1 <- data.frame(A = 1:3)
df2 <- data.frame(B = 1:4)
df3 <- data.frame(C = 1:5)
Compute the number of rows of the smallest dataframe:
min_rows <- min(sapply(list(df1, df2, df3), nrow))
Use subsetting when combining:
diffs <- data.frame(a = df1[1:min_rows,], b = df2[1:min_rows,], c = df3[1:min_rows,] )
diffs
a b c
1 1 1 1
2 2 2 2
3 3 3 3
Alternatively, use merge:
rowmerge <- function(x,y){
# create row indicators for the merge:
x$ind <- 1:nrow(x)
y$ind <- 1:nrow(y)
out <- merge(x,y, all = T, by = "ind")
out["ind"] <- NULL
return(out)
}
Reduce(rowmerge, list(df1, df2, df3))
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 NA 4 4
5 NA NA 5
To get rid of the rows with NAs, remove the all = T.
For your particular case, you would probably call Reduce(rowmerge, sensores), assuming that sensores is a list of dataframes.
Note: if you already have an index somewhere (e.g. a timestamp of some sort), then it would be advisable to simply merge on that index instead of creating ind.

colSums in R: 'x' must be an array of at least two dimensions

I am a beginner in coding in general. I am trying to calculate two parameters from a data frame named a in R. For row i and column j, I am interested in finding:
B = (sum of all values in column j) - a[i,j]
C = (sum of all values in row i) - a[i,j]
For i=1 , j=2, I'm writing:
A = a[1,2]
B = (colSums(a[1:nrow(a),1],na.rm = FALSE, dims = 1) - A)
C = (rowSums(a[1,1:ncol(a)],na.rm = FALSE, dims = 1) - A)
C seems to give correct answer. However, B gives an error:
Error in base::colSums(x, na.rm = na.rm, dims = dims, ...) :
'x' must be an array of at least two dimensions
I have read other threads as well but couldn't find my answer. Do you have any suggestions?
The problem is due to the command a[1:nrow(a),1]. This command selects all rows of the first column of data frame a but returns the result as a vector (not a data frame). The function colSums does not work with one-dimensional objects (like vectors).
As a side note: You don't need 1:nrow(a) to select all rows. The same is easier to achieve with an empty argument before the comma: a[ , 1].
An example data frame:
dat <- data.frame(a = 1:3, b = 4:6)
# a b
# 1 1 4
# 2 2 5
# 3 3 6
If you select one column, the result is converted into a vector automatically.
dat[ , 1]
# [1] 1 2 3
If you specify drop = FALSE, a one-column data frame is returned.
dat[ , 1, drop = FALSE]
# a
# 1 1
# 2 2
# 3 3
This one-column data frame is a two-dimensional object and can therefore be used with colSums.
colSums(dat[ , 1, drop = FALSE])
# a
# 6

Merge on multiple columns results in strange ordering

When two data frames are merged by a numerical column then (by default) they are ordered by that column as a number. However, if two numerical columns are used as the by then it results in a different ordering (in fact it seems as if the numerical columns are converted to strings and sorted as such). Is this expected, or a bug?
For example, consider the following two data frames:
A <- data.frame(a = 1:12, b = 1, x = runif(12))
B <- data.frame(a = 1:12, b = 1, y = runif(12))
Then merge(A, B, by = 'a') results in a data frame with a column a with values 1, 2, ..., 9, 10, 11, 12 (i.e., the expected numerical ordering). However merge(A, B, by = c('a', 'b')) results in a data frame with a column a with values 1, 10, 11, 12, 2, 3, ..., 8, 9 (i.e., the same ordering as sort(as.character(1:12))).
I guess it's rather a feature than a bug of merge.
Inspection of the source code of merge showed that in the case when multiple columns are used for merging, the 'key' columns are internally combined into a vector by using paste().
For example, columns a and b from your data frame A will be represented by the string "1\r1" "2\r1" "3\r1" "4\r1" "5\r1" "6\r1" "7\r1" "8\r1" "9\r1" "10\r1" "11\r1" "12\r1".
merge uses this string to sort the resulting data frame, and that is how it ends up with the alphabetical ordering.
In the case when you merge only by one column, there is no need for using paste, and therefore sorting is performed by using the original type of the column.
Here is the relevant piece of the source code of merge (full text can be obtained by running merge.data.frame without parentheses in R console)
if (l.b == 1L) {
bx <- x[, by.x]
if (is.factor(bx))
bx <- as.character(bx)
by <- y[, by.y]
if (is.factor(by))
by <- as.character(by)
}
else {
if (!is.null(incomparables))
stop("'incomparables' is supported only for merging on a single column")
bx <- x[, by.x, drop = FALSE]
by <- y[, by.y, drop = FALSE]
names(bx) <- names(by) <- paste0("V", seq_len(ncol(bx)))
bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
bx <- bz[seq_len(nx)]
by <- bz[nx + seq_len(ny)]
}
Using the dplyr package, we can get the following result
library("dplyr", lib.loc="~/R/win-library/3.2")
full_join(A, B, by=c("a", "b"))
a b x y
1 1 1 0.39907404 0.700782559
2 2 1 0.84429488 0.600727090
3 3 1 0.32232471 0.141495156
4 4 1 0.74214210 0.262601640
5 5 1 0.92944116 0.779255689
6 6 1 0.10902661 0.001185645
7 7 1 0.46336478 0.961711785
8 8 1 0.58396008 0.211824751
9 9 1 0.63126074 0.422233784
10 10 1 0.09995935 0.179069642
11 11 1 0.40832159 0.581116173
12 12 1 0.48440814 0.004372634

How to change values in a column of a data frame based on conditions in another column?

I would like to have an equivalent of the Excel function "if". It seems basic enough, but I could not find relevant help.
I would like to assess "NA" to specific cells if two following cells in a different columns are not identical. In Excel, the command would be the following (say in C1): if(A1 = A2, B1, "NA"). I then just need to expand it to the rest of the column.
But in R, I am stuck!
Here is an equivalent of my R code so far.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"))
df
To get the following Type of each Type in another column, I found a useful function on StackOverflow that does the job.
# determines the following Type of each Type
shift <- function(x, n){
c(x[-(seq(n))], rep(6, n))
}
df$TypeFoll <- shift(df$Type, 1)
df
Now, I would like to keep TypeFoll in a specific row when the File for this row is identical to the File on the next row.
Here is what I tried. It failed!
for(i in 1:length(df$File)){
df$TypeFoll2 <- ifelse(df$File[i] == df$File[i+1], df$TypeFoll, "NA")
}
df
In the end, my data frame should look like:
aim = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"),
TypeFoll = c("2","3","4","4","5","6"),
TypeFoll2 = c("2","NA","4","4","NA","6"))
aim
Oh, and by the way, if someone would know how to easily put the columns TypeFoll and TypeFoll2 just after the column Type, it would be great!
Thanks in advance
I would do it as follows (not keeping the result from the shift function)
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"), stringsAsFactors = FALSE)
# This is your shift function
len=nrow(df)
A1 <- df$File[1:(len-1)]
A2 <- df$File[2:len]
# Why do you save the result of the shift function in the df?
Then assign if(A1 = A2, B1, "NA"). As akrun mentioned ifelse is vectorised: Btw. this is how you append a column to a data.frame
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), 6) #Why 6?
As 6 is hardcoded here something like:
df$TypeFoll2 <- c(ifelse(A1 == A2, df$Type, NA), max(df$Type)+1)
Is more generic.
First off, 'for' loops are pretty slow in R, so try to think of this as vector manipulation instead.
df = data.frame(Type = c("1","2","3","4","4","5"),
File = c("A","A","B","B","B","C"));
Create shifted types and files and put it in new columns:
df$TypeFoll = c(as.character(df$Type[2:nrow(df)]), "NA");
df$FileFoll = c(as.character(df$File[2:nrow(df)]), "NA");
Now, df looks like this:
> df
Type File TypeFoll FileFoll
1 1 A 2 A
2 2 A 3 B
3 3 B 4 B
4 4 B 4 B
5 4 B 5 C
6 5 C NA NA
Then, create TypeFoll2 by combining these:
df$TypeFoll2 = ifelse(df$File == df$FileFoll, df$TypeFoll, "NA");
And you should have something that looks a lot like what you want:
> df;
Type File TypeFoll FileFoll TypeFoll2
1 1 A 2 A 2
2 2 A 3 B NA
3 3 B 4 B 4
4 4 B 4 B 4
5 4 B 5 C NA
6 5 C NA NA NA
If you want to remove the FileFoll column:
df$FileFoll = NULL;

Resources