Subtraction of rows from data table in R - r

I'm new to this site (and new to R) so I hope this is the right way to approach my problem.
I searched at this site but couldn't find the answer I'm looking for.
My problem is the following:
I have imported a table from a database into R (it says it's a data frame) and I want to substract the values from a particular columnn (row by row). Thereafter, I'd like to assign these differences to a new column called 'Difference' in the same data frame.
Could anyone please tell me how to do this?
Many thanks,
Arjan

To add a new column, just do df <- df$newcol, where df is the name of your data frame, and newcol is the name you want, in this case it would be "Difference". If you want to subtract an existing column using an existing column just use arithmetic operations.
df$Difference <- (df$col1 - df$col2)

I'm going to assume you want to subtract the values in one column from another is this correct? This can be done pretty easily see code below.
first I'm just going to make up some data.
df <- data.frame(v1 = rnorm(10,100,4), v2 = rnorm(10,25,4))
You can subtract values in one column from another by doing just that (see below).
Use $ to specify columns. Adding a new name after the $ will create a new column.
(see code below)
df$Differences <- df$v1 - df$v2
df
v1 v2 Differences
1 98.63754 29.54652 69.09102
2 99.49724 24.27766 75.21958
3 102.73056 25.01621 77.71435
4 100.87495 26.92563 73.94933
5 103.01357 17.46149 85.55208
6 97.24901 20.82983 76.41917
7 100.73915 27.95460 72.78454
8 98.14175 24.19351 73.94824
9 102.63738 21.74604 80.89133
10 105.78443 16.79960 88.98483
Hope this helps

Related

In R, match data from a string variable across two data frames, and when match is found, merge corresponding rows

I have two data frames df1 (4x4) and df2 (4x1). In each, first variable (i.e. Original_items and Reordered) is string. In df1, V2:V4 are numeric. You can see that in df1 and df2, data in the first variable is arranged in a different order. I need to do the following.
Take 1st element of the df2 'Reordered' variable (i.e. Enjoy holidays.), then search through elements of df1 'Original_items' variable to find the exact match.
When match is found, I need to take the entire row of data associated with the matched element in df1 'Original_items' (i.e."Enjoy holidays.", 4,1,3), and append it beside the same element of df2 'Reordered' variable (i.e. "Enjoy holidays"). I need this output in the new data frame, called df_desired, which should be: "Enjoy holidays.", "Enjoy holidays.", 4, ,1 ,3. Please see below illustration of this example.
When this is done, I would like to repeat this process for each element of the df2 'Reordered'variable, so the final result looks like df_desired table below.
Context of the problem. I have around 2,000 items and 1,000 data points associated with each item. As I need to match items and append data in a predefined way, I am trying to think of an efficient solution.
EDIT
It was suggested that I could simply rename items in the "Original Variable". While this is true, it is inconvenient to do for a data frame of more than 2,000 items.
Also, it was mentioned that this question maybe only related to merging. I believe merging is needed here only for elements that have been identified as identical across df1 and df2. Therefore, there are two key questions: 1) how to match string variables in this particular case? 2) how to merge/append rows conditionally, i.e. if they have been matched. Thank you for your input and I would be grateful for your help please
I will mention what I tried and figured out so far. I realised
df1[,1] == df2 [,1] # gives me true or false if rows in column 1 are the
same in both data frames. I tried to set up a double loop, but unsuccessfully
for (i in 1:nrow(df1)) {
for (j in 1:nrow(df2)){
if (i==j){
c <- merge(a,b)
} else
print("no result")
}
}
I feel that in the loop I'm not able to specify that I am only working with row values from a single variable "Original_item" in df1
# df1 (4x4 matrix)
Original_items V2 V3 V4
Love birds. 1 5 3
Eat a lot of food. 2 5 5
Love birthdays. 2 2 4
Enjoy holidays. 4 1 3
# df2 (4x1 matrix)
Reordered
Enjoy holidays.
Eat a lot of food.
Love birds.
Love birthdays.
# df_desired (4x5 matrix)
Reordered Original_items V2 V3 V4
Enjoy holidays. Enjoy holidays. 4 1 3
Eat a lot of food. Eat a lot of food. 2 5 5
Love birds. Love birds. 1 5 3
Love birthdays. Love birthdays. 2 2 4
If i understand correctly, you first want to sort df1$original_items to be in the same order as df2 reorder, then apply that same sorting pattern to the rest of df1 variables.
First get your vector of indices of df1 in the sequential order that you desire those rows of df1 to end up in.
#initialize an object to capture the above output
indices <- NULL
for (i in 1:nrow(df1)) {
indices[i] <- which(df1$Original_items == df2$Reordered[i]))
}
Then, just use this list of indices to reorder the all the rows of df1 and create the new df.
df_desired <- cbind(df2$Reordered, df1[indices, ])

R remove rows, that hasn't got the same value in two columns

Sorry for asking stuff that should be an easy job, I am a geology student, triing to use R for his work in school.
I'd like to remove the rows from my database, where the value at two certain columns do not match.
example:
e F 14 14
t D 14 12
j A 11 11
a R 14 13
So the second row should be removed and the forth as well. The column with the letters should not be relevant, just the two with the numbers.
suppose your data is store in df, to do following:
df <- data.frame(col1= c('e','t','j','a'),
col2 =c('F','D','A','R'),
col3=c(14,14,11,14),
col4=c(14,12,11,13))
df <- df[df$col3==df$col4,]
Simple subset operation:
new_df <- subset(df, columnX == columnY)
So assume the rows that you want to remove is 2,3
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData <- myData[-c(2, 3), ]

R column numbers

Have been working with many different datasets lately and need a quick way to identify the column number of different columns. For example I have a dataset that has 75 variables (or columns). The variables that I need to use are in the middle of the dataset, I know the names of these variables, i.e. g, h, I, j, and k. Rather then writing the names of these variables each time I want to use them or change or reference them I usually use the column number i.e.
for (i in 35:39) { do bla bla bla}
the usual way I find the column number is I look at the data frame and count the columns until I get to the one I want, then I count how many of them there are to get my 35:39. Is there a better way to do this? Is there a better way to find out that column/ variable g is column number 35 and column/variable k is # 39?
Just an expanded version of my comment. As I've said there are several ways to do so, I do not think the right one exist. Here is a possible solution (if I get what you want to achieve of course).
as.data.frame(cbind(column = 1:ncol(iris),names = names(iris)))
column names
1 1 Sepal.Length
2 2 Sepal.Width
3 3 Petal.Length
4 4 Petal.Width
5 5 Species
In such a way you know what name at which column correspond.
If you want to see which column is named g you could do
which(names(mydataframe) == 'g')
which gives you the index of the column with name "g".
You can use match instead of which as you need just one column match(which i suppose would be faster as well).
match('g',names(mydataframe))

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Merging databases in R on multiple conditions with missing values (NAs) spread throughout

I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.

Resources