R : Track Changes Across two columns - r

I have a data frame which records the changes in the name of companies. A simple representation would be :
df <- data.frame(key = c("A", "B","C", "E","F","G"), Change = c("B", "C","D" ,"F","G","H"))
print(df)
Key Change
1 A B
2 B C
3 C D
4 E F
5 F G
6 G H
I want to track all the changes a value is going through. Here is an output that can help me do so:
Key 1st 2nd 3rd 4th
1 A B C D
2 E F G H
How can I do it in R? I am new to R and Programming. It would be great to get help.
The question was marked duplicate of How to reshape data from long to wide format?
However, it is not an exact duplicate. For the reasons :
1. example used here contains data changing across columns. That is not the case in the question of reshaping data. Here, the two columns are dependent on each other.
2. Before reshaping, I reckon there is another step : maybe giving an id for the changes taking place. I am not sure how to do it.
Could you help me?

Can we assume that a same name never appears (never occurs like A->B->C and D->E->A)? If so, you can do the following.
df <- data.frame(key = c("A","B","C", "E","F","G"),
Change = c("B","C","D" ,"F","G","H"))
print(df)
# mapping from old to new name
next_name <- as.character(df$Change)
names(next_name) <- df$key
all_names <- unique(c(as.character(df$key), as.character(df$Change)))
get_id <- function(x) {
# for each name, repeatedly traverse until the final name
ss <- x %in% names(next_name)
if (any(ss)) {
x[ss] <- get_id(next_name[x[ss]])
}
x
}
ids <- get_id(all_names)
lapply(unique(ids), function(i) c(all_names[ids==i]))
# out come is a list of company names,
# each entry represents a history of a firm
##[[1]]
##[1] "A" "B" "C" "D"
##[[2]]
##[1] "E" "F" "G" "H"
The outcome is a list, not data frame since the number of name sequences may not be unique (firms may have different number of names).

Related

Compare two colums in diffrents tables

Say I have two tables, Table A and Table B, and I want to compare a certain column.
For example,
Table A has the columns:
Name,Surname ,Family, species
Table B has the columns:
IP,Genes,Types,Species,Models
How do I compare the Species column between the two tables to get the matches , that means that i want to extract name of species that exist in both tables?
for exemple if the first species column have
a b c d e f g h i
information and the second species colum have
k l m n a b y i l
i want this result :
a b i
Can you tell me please the way i can do that , and also if there s anyway i can do it without usin join
Thank you very much
Try any of these options. I have used dummy data:
#Data
TableA <- data.frame(Species=c('a','b','c','d','e','f','g','h','i'),
Var=1,stringsAsFactors = F)
TableB <- data.frame(Species=c('k','l','m','n','a','b','y','i','l'),
Var2=2,stringsAsFactors = F)
#Option1
TableA$Species[TableA$Species %in% TableB$Species]
#Option 2
intersect(TableA$Species,TableB$Species)
In both cases the output will be:
[1] "a" "b" "i"

How to assign the column name to the variable dynamically

I am currently developing an application and I need to loop through the columns of the data frame. For instance, if the data frame has the columns
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
If the input is given as "a", then the column name "b" should be assigned to the variable, say promote.
It throws an error Error in[.data.frame(char_set, i + 1) : undefined columns selected. Is there any solution?
char_name <- "a"
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
for (i in 1:ncol(char_set)) {
promote <- ifelse(names(char_set) == char_name,char_set[i+1], "-")
print(promote)
}
Thanks in advance!!!
This is actually quite interesting. I would suggest doing something on those lines:
char_name <- "a"
char_set <- data.frame(
a = 1:2,
b = 3:4,
c = 5:6,
d = 8:9,
stringsAsFactors = FALSE
)
res_dta <- data.frame(matrix(nrow = 2, ncol = 3))
for (i in wrapr::seqi(1, NCOL(char_set) - 1)) {
print(i)
if (names(char_set)[i] == char_name) {
res_dta[i] <- char_set[i + 1]
} else {
res_dta[i] <- char_set[i]
}
}
Results
char_set
a b c d
1 1 3 5 8
2 2 4 6 9
res_dta
X1 X2 X3
1 3 3 5
2 4 4 6
There are few generic points:
When you are looping through columns be mindful not fall outside data frame dimensions; running i + 1 on i = 4 will give you column 5 which will return an error for data frame with four columns. You may then decide to run to one column less or break for a specific i value
Not sure if I got your request right, for column names a you want to take values of column b; then column b stays as it was?
Broadly speaking, I'm of a view that this names(char_set)[i] == char_name requires more thought but you have a start with this answer. Updating your post with desired results would help to design a solution.
The problem in your code is that you are looping from 1 to the number of columns of the char_set df, then you are calling the variable char_set[i+1].
This, when the i index takes the maximum value, the instruction char_set[i+1] returns an error because there is no element with that index.
You can try with this solution:
char_name<-"a"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "b"
char_name<-"d"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "-"
However. when the variable char_name takes the value a, the variable promote will take the value that the set char_set has at the position after the element named a, which matches char_name.
I suggest you to think about the case in which the variable char_name takes the value d and you don't have any values in the char_set after d.

What's the best way to add a specific string to all column names in a dataframe in R?

I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4

Dynamic merge in R

I have an example filter table as below and a big source data table. I need to do the merge using these two tables. If no column in the filter table contains ALL, use three columns to do the the merge (using Tran=1001, Acct=1 & Co=a to do the inner join with the data table).If one of them, ie Tran has ALL, use the remaining two columns to do the merge (using Acct=3 & Co=c to do the join). If two of them, ie Tran and Acct, have All, use the remaining one column to do the merge (using Co=b to do the join).
The real question is the number of columns is uncertain.
Can anyone help me with this?
Tran Acct Co
1001 1 a
1002 ALL ALL
ALL ALL b
ALL 4 ALL
1003 2 ALL
ALL 3 c
1004 ALL d
You're going to have to write a series of conditional statements using if, elseif and else. I'll use the %in% operator to check for this. The %in% operator returns a series of boolean values. The easiest way is to show through example:
> x <- c(1, 2, 3, 4, 5)
> y <- c(2, 3, 4, 5, 6)
> x %in% y
[1] FALSE TRUE TRUE TRUE TRUE
Notice that it returns FALSE for the first value as the value of 1 in x is not in y. You can do the same for the "ALL" value in your data set. I assume you are going row by row as you seemed to imply in your question. Let me know if you need to check the whole column first (you can use the any function for that case). Here is an example of your first condition:
# Assume that df is your data.frame of data.
for (i in 1:length(df$Tran)) {
if (!("All" %in% df$Tran[i]) & !("ALL" %in% df$Acct[i]) & !("All" %in% df$Co[i])) {
# Do your merge here
}
if ( [Put your next condition here] ) {
# Do the appropriate merge for that condition
}
...
Note that I used the "!" operator to get the inverse of whatever %in% returns because you want it to be the case where ALL is NOT in the row. I realize now that you could have just done All != df$Tran[1] since you are going row by row, but %in% might be more useful if you end up going for the whole column.
Hope this helps!
Editing in a new method now that it's more clear what the need is. So we have to find the number of "ALL" values in each row and then merge a certain way depending on the number of them. There are a lot of methods, but here's one I like:
> test <- data.frame(a = "ALL", b = 2, c = "ALL", d = 3, e = "ALL")
> test
a b c d e
1 ALL 2 ALL 3 ALL
> table(test[1, ] == "ALL")["TRUE"]
TRUE
3
Basically, I'm looking at the first row, and getting the number that return TRUE when asked if it contains the string "ALL". From here you can set conditionals on this number. To automate over the entire data frame, throw it in a for loop and set "1" equal to "i" or whatever you sequence variable is.
To get which rows have "ALL" in it (which in converse would tell which rows do not have "ALL" in it as well), you can use grep on each row. Here's a short example:
> # Initializing a sample data frame.
> df <- data.frame(a = "1", b = "ALL", c = "ALL", d = "5", e = "ALL")
> print(df)
a b c d e
1 1 ALL ALL 5 ALL
>
> # Finding the column numbers that have "ALL" in it using grep.
> places <- grep("ALL", df[1, ])
> print(places)
[1] 2 3 5
>
> # Each number corresponds to the order of the columns in the data frame and can be returned as such.
> nameCols <- names(df)[places]
> print(nameCols)
[1] "b" "c" "e"
>
> # Likewise, you can find what columns did not have "ALL" in it by doing the opposite.
> nameColsNOT <- names(df)[-places]
> print(nameColsNOT)
[1] "a" "d"
Iterate this method through a loop for each row in your data frame and use the conditional method I outlined above. Please note that this requires your columns to all be of "character" class, which I assume is the case already.

concatenating strings to make variable name

I want to change the name of the output of my R function to reflect different strings that are inputted. Here is what I have tried:
kd = c("a","b","d","e","b")
test = function(kd){
return(list(assign(paste(kd,"burst",sep="_"),1:6)))
}
This is just a simple test function. I get the warning (which is just as bad an error for me):
Warning message:
In assign(paste(kd, "burst", sep = "_"), 1:6) :
only the first element is used as variable name
Ideally I would get ouput like a_burst = 1, b_burst = 2 and so on but am not getting close.
I would like split up a dataframe by contents of a vector and be able to name everything according to the name from that vector, similar to
How to split a data frame by rows, and then process the blocks?
but not quite. The naming is imperative.
Something like this, maybe?
kd = c("a","b","d","e","b")
test <- function(x){
l <- as.list(1:5)
names(l) <- paste(x,"burst",sep = "_")
l
}
test(kd)
You could use a vector instead of a list by way of setNames:
t1_6 <- setNames( 1:6, kd)
t1_6
a b d e b <NA>
1 2 3 4 5 6
> t1_6["a"]
a
1
Looking at the question again I wondered if you wnated to assign sequential names to a character vector:
> a1_5 <- setNames(kd, paste0("alpha", 1:5))
> a1_5
alpha1 alpha2 alpha3 alpha4 alpha5
"a" "b" "d" "e" "b"

Resources