Dividing one column in my dataset by a fix value in R - r

Im pretty new to R and here's (maybe a simple) question:
I have big .dat datasets and I add together two of them to get the sum of the values. The datasets kinda look like this:
#stud1
AMR X1 X2 X3...
1 3 4 10
2 4 5 2
#stud2
AMR X1 X2 X3
1 6 4 6
2 1 2 1
So what I did is
> studAll <- stud1 + stud2
and the result was:
# studAll:
AMR X1 X2 X3
2 9 8 16
4 5 7 3
MY PROBLEM NOW IS:
The AMR column is not meant to change, so my idea was to divide this column through the value "2" to get to the former values. Or is there another solution easier than my idea?

If I understand your question correctly you want to make a new dataframe which adds all the columns except AMR?
You could do it the long way:
studAll$X1 <- Stud1$X1 + Stud2$X1
repeat for each X...
Or this would work if the AMR column is preserved accross all
#set up
stud1 =data.frame(c(1, 2), c(3,4),c(4,5),c(10,2))
stud2 <- stud1
cols <- (c("AMR", "X1", "X2", "X3"))
colnames(stud1) <- cols
colnames(stud2) <- cols
#add them
studAll = stud1 + stud2
#replace the AMR column into studAll from stud1
#this assumes the AMR column is the same in all studs'
studAll$X1 <- stud1$X1
You could also select all columns other than AMR and add them
See for example here http://www.r-tutor.com/r-introduction/data-frame

Related

call variables by name and column number in a data.frame

I have a data frame with columns I want to reorder. However, in different iterations of my script, the total number of columns may change.
>Fruit
Vendor A B C D E ... Apples Oranges
Otto 4 5 2 5 2 ... 3 4
Fruit2<-Fruit[c(32,33,2:5)]
So instead of manually adapting the code (the columns 32 and 33 change) I'd like to do the following:
Fruit2<-Fruit[,c("Apples", "Oranges", 2:5)]
I tried a couple of syntaxes but could not get it to do what I want. I know, this is a simple syntax issue, but I could not find the solution yet.
The idea is to mix the variable name with the vector to reference the columns when writing a new data frame. I don't want to spell out the whole vector in variable names because in reality it's 30 variables.
I'm not sure how your data is stored in R, so this is what I used:
Fruit <- data.frame( "X1" = c("A",4),"X2" = c("B",5),"X3" = c("C",2),"X4"=
c("D",5),"X5"= c("E",2),"X6" = c("Apples",3),"X7"=
c("Oranges",4),row.names = c("Vendor","Otto"),stringsAsFactors = FALSE)
X1 X2 X3 X4 X5 X6 X7
Vendor A B C D E Apples Oranges
Otto 4 5 2 5 2 3 4
Then use:
indexes <- which(Fruit[1,]%in%c("Apples","Oranges"))
Fruit2<- Fruit[,c(indexes,2:5)]
Fruit[1,] references the Vendor row, and "%in%" returns a logical vector to the function "which". Then "which" returns indexes.
This gives:
> Fruit2
X6 X7 X2 X3 X4 X5
Vendor Apples Oranges B C D E
Otto 3 4 5 2 5 2
Make sure your data are not being stored as factors, otherwise this will not work. Or you could change the Vendor row to column names as per the comment above.
The answer is, as I found out, use the dplyr package.
It is very powerful.
The solution to the aforementioned problem would be:
Fruit2<-Fruit %>% select(Apples,Oranges,A:E)
This allows dynamic selection of columns and lists of columns even if the indexes of the columns change.

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

rowSums using an indirect variable (i.e. using a string variable to allocate the column numbers)

I'm still pretty much a newbie in R but enjoying the journey so far. I'm trying to group weekly columns together into quarters, and try to create a more elegant solution rather than creating separate lines to assign values.
So I have created a list of values to contain the column ranges, e.g. Q1 <- 5:9, Q2 <- 10:22, and so forth. After reading the original data frame, I want to create a new one that has Q1 as the variable, and contains the total of column 5-9, Q2 with the total of 10:22, etc. The problem is, rowSums doesn't like me using a variable to denote the actual range.
This is what I am trying to achieve, with sval containing the original weekly data, and qsval, containing the quarterly totals:
Q110 <- 5:9
Q210 <- 10:22
Q310 <- 23:35
Q410 <- 36:48
Q111 <- 49:61
Q211 <- 62:74
Q311 <- 75:87
Q411 <- 88:100
qsval <- sval[,c(1:4)] # Copying the first four columns from the weekly data
period <- c('Q110','Q210','Q310','Q410','Q111','Q211','Q311','Q411')
for (i in 1:8) {
assign(qsval$period[i], rowSums(sval,na.rm=F, get(period[i])))
}
Is this possible at all? The error message given is:
Error in rowSums(sval, na.rm = F, get(period[i])) : invalid 'dims'
Any advice would be much appreciated! Thank you.
In the absence of reproducible data, here's an example which hopefully you can adapt to your specific case:
set.seed(1) # just to make the random data reproducible
sval <- data.frame(replicate(6,sample(1:3)))
# X1 X2 X3 X4 X5 X6
#1 1 3 3 1 3 2
#2 3 1 2 3 1 3
#3 2 2 1 2 2 1
Qlist <- list(Q1=1:3,Q2=4:6)
qsval <- data.frame(lapply(Qlist, function(x) rowSums(sval[x]) ))
# Q1 Q2
#1 7 6
#2 6 7
#3 5 5

Append values from column 2 to values from column 1

In R, I have two data frames (A and B) that share columns (1, 2 and 3). Column 1 has a unique identifier, and is the same for each data frame; columns 2 and 3 have different information. I'm trying to merge these two data frames to get 1 new data frame that has columns 1, 2, and 3, and in which the values in column 2 and 3 are concatenated: i.e. column 2 of the new data frame contains: [data frame A column 2 + data frame B column 2]
Example:
dfA <- data.frame(Name = c("John","James","Peter"),
Score = c(2,4,0),
Response = c("1,0,0,1","1,1,1,1","0,0,0,0"))
dfB <- data.frame(Name = c("John","James","Peter"),
Score = c(3,1,4),
Response = c("0,1,1,1","0,1,0,0","1,1,1,1"))
dfA:
Name Score Response
1 John 2 1,0,0,1
2 James 4 1,1,1,1
3 Peter 0 0,0,0,0
dfB:
Name Score Response
1 John 3 0,1,1,1
2 James 1 0,1,0,0
3 Peter 4 1,1,1,1
Should results in:
dfNew <- data.frame(Name = c("John","James","Peter"),
Score = c(5,5,4),
Response = c("1,0,0,1,0,1,1,1","1,1,1,1,0,1,0,0","0,0,0,0,1,1,1,1"))
dfNew:
Name Score Response
1 John 5 1,0,0,1,0,1,1,1
2 James 5 1,1,1,1,0,1,0,0
3 Peter 4 0,0,0,0,1,1,1,1
I've tried merge but that simply appends the columns (much like cbind)
Is there a way to do this, without having to cycle through all columns, like:
colnames(dfNew) <- c("Name","Score","Response")
dfNew$Score <- dfA$Score + dfB$Score
dfNew$Response <- paste(dfA$Response, dfB$Response, sep=",")
The added difficulty is, as you might have noticed, that for some columns we need to use addition, whereas others require concatenation separated by a comma (the columns requiring addition are formatted as numerical, the others as text, which might make it easier?)
Thanks in advance!
PS. The string 1,0,0,1,0,1,1,1 etc. captures the response per trial – this example has 8 trials to which participants can either respond correctly (1) or incorrectly (0); the final score is collected under Score. Just to explain why my data/example looks the way it does.
Personally, I would try to avoid concatenating 'response per trial' to a single variable ('Response') from the start, in order to make the data less static and facilitate any subsequent steps of analysis or data management. Given that the individual trials already are concatenated, as in your example, I would therefore consider splitting them up. Formatting the data frame for a final, pretty, printed output I would consider a different, later issue.
# merge data (cbind would also work if data are ordered properly)
df <- merge(x = dfA[ , c("Name", "Response")], y = dfB[ , c("Name", "Response")],
by = "Name")
# rename
names(df) <- c("Name", c("A", "B"))
# split concatenated columns
library(splitstackshape)
df2 <- concat.split.multiple(data = df, split.cols = c("A", "B"),
seps = ",", direction = "wide")
# calculate score
df2$Score <- rowSums(df2[ , -1])
df2
# Name A_1 A_2 A_3 A_4 B_1 B_2 B_3 B_4 Score
# 1 James 1 1 1 1 0 1 0 0 5
# 2 John 1 0 0 1 0 1 1 1 5
# 3 Peter 0 0 0 0 1 1 1 1 4
I would approach this with a for loop over the column names you want to merge. Given your example data:
cols <- c("Score", "Response")
dfNew <- dfA[,"Name",drop=FALSE]
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfA[[n]] + dfB[[n]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfA[[n]], dfB[[n]], sep=",")})
}
This solution is basically what you had as your idea, but with a loop. The data sets are looked at to see if they are numeric (add them numerically) or a string or factor (concatenate the strings). You could get a similar result by having two vectors of names, one for the numeric and one for the character, but this is extensible if you have other data types as well (though I don't know what they might be). The major drawback of this method is that is assumes the data frames are in the same order with regard to Name. The next solution doesn't make that assumption
dfNew <- merge(dfA, dfB, by="Name")
for (n in cols) {
switch(class(dfA[[n]]),
"numeric" = {dfNew[[n]] <- dfNew[[paste0(n,".x")]] + dfNew[[paste0(n,".y")]]},
"factor"=, "character" = {dfNew[[n]] <- paste(dfNew[[paste0(n,".x")]], dfNew[[paste0(n,".y")]], sep=",")})
dfNew[[paste0(n,".x")]] <- NULL
dfNew[[paste0(n,".y")]] <- NULL
}
Same general idea as previous, but uses merge to make sure that the data is correctly aligned, and then works on columns (whose names are postfixed with ".x" and ".y") with dfNew. Additional steps are included to get rid of the separate columns after joining. Also has the bonus feature of carrying along any other columns not specified for joining together in cols.

Combine two data.frames in R with differing rows

I have two tables one with more rows than the other. I would like to filter the rows out that both tables share. I tried the solutions proposed here.
The problem, however, is that it is a large data-set and computation takes quite a while. Is there any simple solution? I know how to extract the shared rows of both tables using:
rownames(x1)->k
rownames(x)->l
which(rownames(x1)%in%l)->o
Here x1 and x are my data frames. But this only provides me with the shared rows. How can I get the unique rows of each table to then exclude them respectively? So that I can just cbind both tables together?
(I edit the whole answer)
You can merge both df with merge() (from Andrie's comment). Also check ?merge to know all the options you can put in as by parameter, 0 = row.names.
The code below shows an example with what could be your data frames (different number of rows and columns)
x = data.frame(a1 = c(1,1,1,1,1), a2 = c(0,1,1,0,0), a3 = c(1,0,2,0,0), row.names = c('y1','y2','y3','y4','y5'))
x1 = data.frame(a4 = c(1,1,1,1), a5 = c(0,1,0,0), row.names = c('y1','y3','y4','y5'))
Provided that row names can be used as identifier then we put them as a new column to merge by columns:
x$id <- row.names(x)
x1$id <- row.names(x1)
# merge by column names
merge(x, x1, by = intersect(names(x), names(x1)))
# result
# id a1 a2 a3 a4 a5
# 1 y1 1 0 1 1 0
# 2 y3 1 1 2 1 1
# 3 y4 1 0 0 1 0
# 4 y5 1 0 0 1 0
I hope this solves the problem.
EDIT: Ok, now I feel silly. If ALL columns have different names in both data frames then you don't need to put the row name as another column. Just use:
merge(x,x1, by=0)
If you only want the rows which are not repeated from each data set:
rownames(x1)->k
rownames(x)->l
which(k%in%l) -> o
x1.uniq <- x1[k[k != o],];
x.uniq <- x[l[l != o],];
And then you can join them with rbind:
x2 <- rbind(x1.uniq,x.uniq);
If you also wanted the repeated rows you can add them:
x.repeated <- x1[o];
x2 <- rbind(x2,x.repeated);

Resources