Vlookup/ Match function in R for continuous column in R - r

I have a 2 dataframe.
df1:
Dis1_SubDIs1_Village1 Dis2_SubDIs1_Village1 Dis1_SubDIs2_Village1
JODHPUR|JODHPUR|JODHPUR |JODHPUR|JODHPUR JODHPUR||JODHPUR
JHUNJHUNUN|JHUNJHUNUN|BARI |JHUNJHUNUN|BARI JHUNJHUNUN|BARI|BARI
BUNDI|HINDOLI|BUNDI |HINDOLI|BUNDI BUNDI|BUNDI|BUNDI
SIROHI|SIROHI|SIROHI |SIROHI|SIROHI SIROHI||SIROHI
ALWAR|ALWAR|BASAI |ALWAR|BASAI ALWAR||BASAI
BHARATPUR|BHARATPUR|SEEKRI |BHARATPUR|SEEKRI BHARATPUR||SEEKRI
and second data,
df2 :
High
|BHARATPUR|SEEKRI
BUNDI|HINDOLI|BUNDI
SIROHI||SIROHI
CHURU|TARANAGAR|DABRI CHHOTI
Now, I want to apply vloook/match in df1 with respect to df2 column. The same we do in excel.
If exact matches are there, give me the match, else 0.
I tried making the function in R
For match
for(i in names(df1)){
match_vector = match(df_final[,i], df$High, incomparables = NA)
df1$High = df2$High[match_vector]
}
but getting an error. It's showing only for the last column and replacing the value of other column.
For vlookup:
func_vlook = function(a){
for(i in 1:ncol(a)) {
lookup_df = vlookup_df(lookup_value = i,
dict = df2,
lookup_column = 1)
}
return(lookup_df)
}
lookup_df <- func_vlook(a = df1)
Still getting an error.
My final Output should be like the below attached with df file:
Dis1_SubDIs1_Village1_M1 Dis2_SubDIs1_Village1_M2 Dis1_SubDIs2_Village1_M3
NA NA NA
NA NA NA
BUNDI|HINDOLI|BUNDI NA NA
NA SIROHI||SIROHI SIROHI||SIROHI
NA NA NA
NA NA |BHARATPUR|SEEKRI
for the N no. of columns, there should be N no. of columns with match
Please help.

No need for any loops with this one - apply and match should work fine. apply will iterate over as many columns you have, so the output will have the same number of columns as the input. In your example, apply will simplify to produce a matrix.
apply(X = df1,
MARGIN = 2,
FUN = function(x) df2$High[match(x, df2$High)])
If you need a dataframe as the output, then wrap the code below in as.data.frame()
as.data.frame(apply(X = df1,
MARGIN = 2,
FUN = function(x) df2$High[match(x, df2$High)]))

Related

Why does code throw error when looped? My code works when I incremement index "by hand" but when I put in loop it fails

I want to append values from one dataframe as column names to an another data frame.
I've written code that will produce one column at a time if I "manually" assigne index values:
df_searchtable <- data.frame(category = c("air", "ground", "ground", "air"), wiggy = c("soar", "trot", "dive", "gallop"))
df_host <- data.frame(textcolum = c("run on the ground", "fly through the air"))
#create vector of categories
categroups <- as.character(unique(df_searchtable$category))
##### if I assign colum names one at a time using index numbers no prob:
group = categroups[1]
df_host[, group] <- NA
##### if I use a loop to assign the column names:
for (i in categroups) {
group = categroups[i]
df_host[, group] <- NA
}
the code fails, giving:
Error in [<-.data.frame(`*tmp*`, , group, value = NA) :
missing values are not allowed in subscripted assignments of data frames
How can I get around this problem?
Here's a simple base R solution:
df_host[categroups] <- NA
df_host
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA
The problem with your loop is that you are looping through each element whereas your code assumes you are looping through 1, 2, ..., n.
For instance:
for (i in categroups) {
print(i)
print(categroups[i])
}
[1] "air"
[1] NA
[1] "ground"
[1] NA
To fix your loop, you could do one of two things:
for (group in categroups) {
df_host[, group] <- NA
}
# or
for (i in seq_along(categroups)) {
group <- categroups[i]
df_host[, group] <- NA
}
Here's a solution using purrr's map.
bind_cols(df_host,
map_dfc(categroups,
function(group) tibble(!!group := rep(NA_real_, nrow(df_host)))))
Gives:
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA
map_dfc maps over the input categroups, creates a single-column tibble for each one, and joins the newly created tibbles into a dataframe
bind_cols joins the original dataframe to your new tibble
Alternatively you could use walk:
walk(categroups, function(group){df_host <<- mutate(df_host, !!group := rep(NA_real_, nrow(df_host)))})
Here's an ugly base R solution: create an empty matrix with the column names and cbind it to the second dataframe.
df_searchtable <- data.frame(category = c("air", "ground", "ground", "air"),
wiggy = c("soar", "trot", "dive", "gallop"),
stringsAsFactors = FALSE)
df_host <- data.frame(textcolum = c("run on the ground", "fly through the air"),
stringsAsFactors = FALSE)
cbind(df_host,
matrix(nrow = nrow(df_host),
ncol = length(unique(df_searchtable$category)),
dimnames = list(NULL, unique(df_searchtable$category))))
Result:
textcolum air ground
1 run on the ground NA NA
2 fly through the air NA NA

How to create a new variable based on another variable's value?

I have a vector containing string representing names of variables that should be in my final df. Those names could change every time based on other conditions.
x <- colnames(df)
y <- c("blue", "yellow", "red")
z <- setdiff(y,x)
Let's say my result now is that: z = c("blue", "red")
I would like a function that, if any element of vector y is missing from z, THEN the function will create a column on df with such element as variable name.
Here's my inconclusive attempt:
if (length(z) > 0) {
for (i in z) {
df$i <- NA
}
}
The part I don't know how to do is pass i as an argument for creating a new variable on df.
In my example: I should finally get df$yellow as a new variable of df.
I checked many posts, either I don't understand how it works, or they are not doing what I need, some for reference:
Create new variables based on another df
Rename variable based on textInput value in Shiny
Executing a function with paste to create a new variable in a dataframe in R
Evaluate expression given as a string
this is one possibility without any loops:
df <- data.frame(x = 1:5)
z <- c("blue", "red")
df[z] <- NA_character_
x blue red
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 NA NA
5 5 NA NA
Solution was indeed the simple suggestion from #akrun:
You can use [ instead of $ i.e. df[z] <- NA Reproducible mtcars[z] <- NA; head(mtcars)
Hence, as follows:
if (length(z) > 0) {
for (i in z) {
df[i] <- NA
}
}

too many NA values in dataset for na.omit to handle

I have a text file dataset that I read as follows:
cancer1 <- read.table("cancer.txt", stringsAsFactors = FALSE, quote='', header=TRUE,sep='\t')
I then have to convert the class of the constituent values so that I can perform mathematical analyses on the df.
cancer<-apply(cancer1,2, as.numeric)
This introduces >9000 NA values into a "17980 X 598" df. Hence there are too many NA values to just simply use "na.omit" as that just removes all of the rows....
Hence my plan is to replace each NA in each row with the mean value of that row, my attempt is as follows:
for(i in rownames(cancer)){
cancer2<-replace(cancer, is.na(cancer), mean(cancer[i,]))
}
However this removes every row just like na.omit:
dim(cancer2)
[1] 0 598
Can someone tell me how to replace each of the NA values with the mean of that row?
You can use rowMeans with indexing.
k <- which(is.na(cancer1), arr.ind=TRUE)
cancer1[k] <- rowMeans(cancer1, na.rm=TRUE)[k[,1]]
Where k is an indices of the rows with NA values.
This works better than my original answer, which was:
for(i in 1:nrow(cancer1)){
for(n in 1:ncol(cancer1)){
if(is.na(cancer1[i,n])){
cancer1[i,n] <- mean(t(cancer1[i,]), na.rm = T)# or rowMeans(cancer1[i,], na.rm=T)
}
}
}
sorted it out with code adapted from related post:
cancer1 <- read.table("TCGA_BRCA_Agilent_244K_microarray_genomicMatrix.txt", stringsAsFactors = FALSE, quote='' ,header=TRUE,sep='\t')
t<-cancer1[1:800, 1:400]
t<-t(t)
t<-apply(t,2, as.numeric) #constituents read as character strings need to be converted
#to numerics
cM <- rowMeans(t, na.rm=TRUE) #necessary subsequent data cleaning due to the
#introduction of >1000 NA values- converted to the mean value of that row
indx <- which(is.na(t), arr.ind=TRUE)
t[indx] <- cM[indx[,2]]

Function which uses different dataframes with same column names

I have two dataframes which look like follows:
df1 <- data.frame(V1 = 1:4, V2 = rep(2, 4), V3 = 7:4)
df2 <- data.frame(V2 = rep(NA, 4), V1 = rep(NA, 4), V3 = rep(NA, 4))
I need to write a function which assigns the values of df1 to df2, if the columnnames of both dataframes are the same. The structure of the function should look like this:
fun <- function(x){
if(# If the name of x is the same like the name of a column in df1)
out <- df1$? # Here I need to assign df1$"x" somehow
out
}
fun(df2$V1)
The output should look like this:
[1] 1 2 3 4
Unfortunately I couldnt find a solution by myself. Is there a way how I could do this? Thank you very much in advance!
I need to write a function which assigns the values of df1 to df2, if
the columnnames of both dataframes are the same.
Are you sure you need a function?
names_in_common <- intersect(names(df1),names(df2))
df2[,names_in_common] <- df1[,names_in_common]
Using Joachim Schork's code:
names_in_common <- intersect(names(df1),names(df2))
df2[,names_in_common] <- df1[,names_in_common]
and if you want to change a single column of df2:
names_in_common <- intersect(names(df1), names(df2[, "V1", drop=FALSE]))
df2[,names_in_common] <- df1[,names_in_common]
This is impossible, because when you access a column of a data.frame using the dollar syntax you lose the column name. There's no way for fun() to determine the column name of the vector that was passed in as an argument.
Instead, you can simply call fun() using the column name itself as the argument, rather than the vector of NAs, which are not useful and not used at all inside the function. In other words, the call becomes
fun('V1');
Then you can write the function as follows:
fun <- function(name) df1[[name]];
Demo:
fun('V1');
## [1] 1 2 3 4
Although now that I think about it, you might as well just index df1 directly, since that's all the function does now:
df1$V1;
## [1] 1 2 3 4
Rereading your question, you said you want to assign the column from df1 to df2, although your example code doesn't do that. Assuming you did want to carry out this assignment inside the function, you could do this:
fun <- function(name) df2[[name]] <<- df1[[name]];
Demo:
fun('V1');
df2;
## V2 V1 V3
## 1 NA 1 NA
## 2 NA 2 NA
## 3 NA 3 NA
## 4 NA 4 NA
This makes use of the superassignment operator <<-.

Creating function to read data set and columns and displyaing nrow

I am struggling a bit with a probably fairly simple task. I wanted to create a function that has arguments of dataframe(df), column names of dataframe(T and R), value of the selected column of dataframe(a and b). I know that the function reads the dataframe. but , I don't know how the columns are selected. I'm getting an error.
fun <- function(df,T,a,R,b)
{
col <- ds[c("x","y")]
omit <- na.omit(col)
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
nrow(data2)/nrow(data1)
}
fun(jugs,Place,UK,Price,10)
I'm new to r language. So, please help me.
There are several errors you're making.
col <- ds[c("x","y")]
What are x and y? Presumably they're arguments that you're passing, but you specify T and R in your function, not x and y.
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
Again, presumably, you want a and b to be arguments you passed to the function, but you specified 'a' and 'b' which are specific, not general arguments. Also, I assume that second "omit$x" should be "omit$y" (or vice versa). And actually, since you just made this into a new data frame with two columns, you can just use the column index.
nrow(data2)/nrow(data1)
You should print this line, or return it. Either one should suffice.
fun(jugs,Place,UK,Price,10)
Finally, you should use quotes on Place, UK, and Price, at least the way I've done it.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
data1 <- omit[omit[,1] == val1,]
data2 <- omit[omit[,2] == val2,]
print(nrow(data2)/nrow(data1))
}
fun(jugs, "Place", "UK", "Price", 10)
And if I understand what you're trying to do, it may be easier to avoid creating multiple dataframes that you don't need and just use counts instead.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
n1 <- sum(omit[,1] == val1)
n2 <- sum(omit[,2] == val2)
print(n2/n1)
}
fun(jugs, "Place", "UK", "Price", 10)
I would write this function as follows:
fun <- function(df,T,a,R,b) {
data <- na.omit(df[c(T,R)]);
sum(data[[R]]==b)/sum(data[[T]]==a);
};
As you can see, you can combine the first two lines into one, because in your code col was not reused anywhere. Secondly, since you only care about the number of rows of the two subsets of the intermediate data.frame, you don't actually need to construct those two data.frames; instead, you can just compute the logical vectors that result from the two comparisons, and then call sum() on those logical vectors, which naturally treats FALSE as 0 and TRUE as 1.
Demo:
fun <- function(df,T,a,R,b) { data <- na.omit(df[c(T,R)]); sum(data[[R]]==b)/sum(data[[T]]==a); };
df <- data.frame(place=c(rep(c('p1','p2'),each=4),NA,NA), price=c(10,10,20,NA,20,20,20,NA,20,20), stringsAsFactors=F );
df;
## place price
## 1 p1 10
## 2 p1 10
## 3 p1 20
## 4 p1 NA
## 5 p2 20
## 6 p2 20
## 7 p2 20
## 8 p2 NA
## 9 <NA> 20
## 10 <NA> 20
fun(df,'place','p1','price',20);
## [1] 1.333333

Resources