Replacing value of one df column only in specific rows - r

I have vector index that corresponds to the rows of a df I want to modify for one specific column
index <- c(1,3,6)
dm <- one two three four
x y z k
a b c r
s e t f
e d f a
a e d r
q t j i
Now I want to modify column "three" only for rows 1, 3 and 6 replacing whatever value in it with "A".
Should I use apply?

There is no need for apply. You could simply use the following:
dm$three[index] <- "A"

Related

How to find the longest sequence of non-NA rows in R?

I have an ordered dataframe with many variables, and am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column. Is there an easy way to do this? I have tried the na.contiguous() function but my data is not formatted as a time series.
My intuition is to create a running counter which determines whether a row has NA or not, and then will determine the count for the number of consecutive rows without an NA. I would then put this in an if statement to keep restarting every time an NA is encountered, outputting a dataframe with the lengths of every sequence of non-NAs, which I could use to find the longest such sequence. This seems very inefficient so I'm wondering if there is a better way!
If I understand this phrase correctly:
[I] am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column
You have a column of interest, call it WANT, and are looking to isolate all columns from the single row of data with the highest consecutive non-NA values in WANT.
Example data
df <- data.frame(A = LETTERS[1:10],
B = LETTERS[1:10],
C = LETTERS[1:10],
WANT = LETTERS[1:10],
E = LETTERS[1:10])
set.seed(123)
df[sample(1:nrow(df), 2), 4] <- NA
# A B C WANT E
#1 A A A A A
#2 B B B B B
#3 C C C <NA> C
#4 D D D D D
#5 E E E E E
#6 F F F F F
#7 G G G G G
#8 H H H H H
#9 I I I I I # want to isolate this row (#9) since most non-NA in WANT
#10 J J J <NA> J
Here you would want all I values as it is the row with the longest running non-NA values in WANT.
If my interpretation of your question is correct, we can extend the excellent answer found here to your situation. This creates a data frame with a running tally of consecutive non-NA values for each column.
The benefit of using this is that it will count consecutive non-NA runs across all columns (of any type, ie character, numeric), then you can index on whatever column you want using which.max()
# from #jay.sf at https://stackoverflow.com/questions/61841400/count-consecutive-non-na-items
res <- as.data.frame(lapply(lapply(df, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
# index using which.max()
want_data <- df[which.max(res$WANT), ]
#> want_data
# A B C WANT E
#9 I I I I I
If this isn't correct, please edit your question for clarity.

Checking whether a value in table 1 column A is present in table 2 column E

I have table (table 1) which contains a column with names of companies with a certain permit. I have another table (table 2) which contains all the information of companies being active. Now I would like to check whether the companies listed in table 2 are present in table 1.
So basically I want to compare the values of table 2 column company name to the values of table 1 column company name. Something like v-lookup. How can I most easily execute this in R?
It is not hard to make up some data that illustrate your question. That is what the FAQ shows you how to do:
set.seed(42)
table1 <- data.frame(company=sample(LETTERS, 10))
table2 <- data.frame(company=LETTERS)
table1$company is a vector of the companies with permits and table2$company is a vector of all of the companies. Now use %in% to find which companies in table2 are in table1:
intable1 <- table2$company %in% table1$company
intable1 is a logical vector which is TRUE of the table2$company is in table1$company. You can add this column to table2 as a logical vector or print the results:
table2[intable1, ]
# [1] A D E G J O Q R V Z
# Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
The company names are factors in this example. If you want them to be character strings, use
table1 <- data.frame(company=sample(LETTERS, 10), stringsAsFactors=FALSE)
table2 <- data.frame(company=LETTERS, stringsAsFactors=FALSE)

intersection with tolerance of non-equal vectors and ID

I have a question about matching values between two vectors.
Lets say I have a vector and data frame:
data.frame
value name vector 2
154.0031 A 154.0084
154.0768 B 159.0344
154.2145 C 154.0755
154.4954 D 156.7758
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I
Now I want to compare vector 2 with values in the data frame with a defined global tolerance (e.g. +-0.005) that is adjustable and add the corresponding names to vector 2, so I get a result like this:
data.frame
value name vector 2 name
154.0031 A 154.0074 A
154.0768 B 159.0334 G
154.2145 C 154.0755 B
154.4954 D 156.7758 E
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I
I tried to use intersect() but there is no option for tolerance in it?
Many thanks!
This outcome can be achieved through with outer, which, and subsetting.
# calculate distances between elements of each object
# rows are df and columns are vec 2
myDists <- outer(df$value, vec2, FUN=function(x, y) abs(x - y))
# get the values that have less than some given value
# using arr.ind =TRUE returns a matrix with the row and column positions
matches <- which(myDists < 0.05, arr.ind=TRUE)
data.frame(name = df$name[matches[, 1]], value=vec2[matches[, 2]])
name value
1 A 154.0084
2 G 159.0344
3 B 154.0755
4 E 156.7758
Note that this will only return elements of vec2 with matches and will return all elements of df that satisfy the threshold.
to make the results robust to this, use
# get closest matches for each element of vec2
closest <- tapply(matches[,1], list(matches[,2]), min)
# fill in the names.
# NA will appear where there are no obs that meet the threshold.
data.frame(name = df$name[closest][match(as.integer(names(closest)),
seq_along(vec2))], value=vec2)
Currently, this returns the same result as above, but will return NAs where there is no adequate observation in df.
data
Please provide reproducible data if you ask a question in the future. See below.
df <- read.table(header=TRUE, text="value name
154.0031 A
154.0768 B
154.2145 C
154.4954 D
156.7731 E
156.8399 F
159.0299 G
159.6555 H
159.9384 I")
vec2 <- c(154.0084, 159.0344, 154.0755, 156.7758)

R Compare non side-by-side duplicates in 2 columns

There are many similar questions but I'd like to compare 2 columns and delete all the duplicates in both columns so that all that is left is the unique observations in each column. Note: Duplicates are not side-by-side. If possible, I would also like a list of the duplicates (not just TRUE/FALSE). Thanks!
C1 C2
1 a z
2 c d
3 f a
4 e c
would become
C1 C2
1 f z
2 e d
with duplicate list
duplicates: a, c
Here is another answer
where_dupe <- which(apply(df, 2, duplicated), arr.ind = T)
Gives you the location of the duplicated elements within your original data frame.
col_unique <- setdiff(1:ncol(df), where_dupe)
Gives you which columns had no duplicates
You can find out the values by indexing.
df[,col_unique]
Here is a base R method using duplicated and lapply.
temp <- unlist(df)
# get duplicated elements
myDupeVec <- unique(temp[duplicated(temp)])
# get list without duplicates
noDupesList <- lapply(df, function(i) i[!(i %in% myDupeVec)])
noDupesList
$C1
[1] "f" "e"
$C2
[1] "z" "d"
data
df <- read.table(header=T, text=" C1 C2
1 a z
2 c d
3 f a
4 e c ", as.is=TRUE)
Note that this returns a list. This is much more flexible structure, as there is generally a possibility that a level may be repeated more than once in a particular variable. If this is not the case, you can use do.call and data.frame to put the result into a rectangular structure.
do.call(data.frame, noDupesList)
C1 C2
1 f z
2 e d

Count occurrence of unique variables

I have a data frame of variables, some occur more than once, e.g.:
a, b, b, b, c, c, d, e, f
I would then like to get an output (in two columns) like this:
a 1; b 3; c 2; d 1; e 1; f 1.
Bonus question: I'd like the variable to be named something (e.g. 'other' if less than 2 occurrences) if the variable is appearing less than 'n' times in the counted column.
Tabulating and collapsing
Your example vector is
vec <- letters[c(1,2,2,2,3,3,4,5,6)]
To get a tabulation, use
tab <- table(vec)
To collapse infrequent items (say, with counts below two), use
res <- c(tab[tab>=2],other=sum(tab[tab<2]))
# b c other
# 3 2 4
Displaying in two columns
resdf <- data.frame(count=res)
# count
# b 3
# c 2
# other 4
Technically, the "first column" here is the row labels, accessible with rownames(resdf).
Similar options include:
stack(res) for two actual columns
data.frame(count=sort(res,decreasing=TRUE)) to sort
In all of these, tab or c(tab) can be used in place of res.

Resources