Populating a column with values from two of several columns based on value in another column - r

I have a data cleaning/transformation problem which I've solved in a way which I'm 1,000% sure could have been solved much more simply.
Below is an example of what my data looks like initially. The first four columns are numebrs I'll use for a lookup, the next is the type of the item, and the last two columns are the ones I want to fill. Based on the value of the column type I would like to fill in the value_one and value_two columns with the values of the same numbered column of the matching type- either one_apple and two_apple or one_orange and two_orange . For example, for the first row if the value is "apple", I would like to fill value_one with the value of one_apple for that row, and value_two with the value of two_apple from that row.
one_apple one_orange two_apple two_orange type value_one value_two
1 23 56 90 orange NA NA
2 24 57 91 orange NA NA
3 25 58 92 apple NA NA
4 26 59 93 apple NA NA
5 27 60 94 orange NA NA
6 28 61 95 apple NA NA
...
This is what I would like that dataframe to look like after I run my code:
one_apple one_orange two_apple two_orange type value_one value_two
1 23 56 90 apple 1 56
2 24 57 91 orange 24 91
3 25 58 92 apple 3 58
4 26 59 93 apple 4 59
5 27 60 94 apple 5 60
6 28 61 95 apple 6 61
...
The way I have solved this right now is to use a for loop, which figures out the index of the columns matching the type value in that row, which(str_sub(names(example_data), start = 5) == example_data$type[i]). Then I use that index to select the correct value for the value_one column from the appropriate place, example_data[i,...)[1]] and assign it to value_one. I do the same thing for value_two.
Below I have code which first creates an example dataset like the one I want to transform, and then shows my for loop running on it to transform the data.
example_data = data.frame(one_apple = 1:(1+30), one_orange = 23:(23+30), two_apple = 56:(56+30), two_orange = 90:(90+30), type = sample(c("apple","orange"), 31, replace = T), value_one = rep(NA,31), value_two = rep(NA,31))
for(i in 1:nrow(example_data)){
example_data$value_one[i] = example_data[i,which(str_sub(names(example_data), start = 5) == example_data$type[i])[1]]
example_data$value_two[i] = example_data[i,which(str_sub(names(example_data), start = 5) == example_data$type[i])[2]]
}
This transformation works, but it is clearly not great code and I feel like I'm missing an easier way to do it with apply and without the convoluted use of which to grab column indexes and stuff. It would be very helpful to see a better way to do this.

Related

How to extend a hash with multiple values in R

So I understand that in R, a hash() is similar to a dictionary. I would like to extract specific values from my dataframe and put them in to a hash.
The componentindex column is were I have my keys and the cluster.index + UniqueFileSourceCounts columns contain my values. So for the same key I have multiple values. e.g: hash {91: [1,15],[22,99] etc..
So I would like to create a hash that contains each key, with multiple values. But im not sure how to do that.
mini_df <- head(df,10) #using a small df
compID <- unique(mini_df$componentindex) #list with unique keys
h1 <- hash()
for (i in 1:length(mini_df)){
if(compID == mini_df[i,"componentindex"]){
h1 <- hash(mini_df[i,"componentindex"] ,c(mini_df[i,"cluster.index"],mini_df[i,"UniqueFileSourcesCount"]))
}
#h2 <- append(h2,h1)
}
if I print h1 , I end up having only the last value:
<hash> containing 1 key-value pair(s).
91 : 42 5
Which I understand since I don't append to this hash but overwrite it. Im not sure how to append/expand hashes in R and I have not been able to find a solution yet.
mini_df:
UniqueFileSourcesCount cluster.index componentindex
1 15 1 91
2 15 10 -1
3 99 22 91
4 63 23 1675
5 12 25 91
6 6 27 91
7 50 37 91
8 5 42 91
9 2 43 -1
10 2 69 -1

Summing values after every third position in data frame in R

I am new to R. I have a data frame like following
>df=data.frame(Id=c("Entry_1","Entry_1","Entry_1","Entry_2","Entry_2","Entry_2","Entry_3","Entry_4","Entry_4","Entry_4","Entry_4"),Start=c(20,20,20,37,37,37,68,10,10,10,10),End=c(50,50,50,78,78,78,200,94,94,94,94),Pos=c(14,34,21,50,18,70,101,35,2,56,67),Hits=c(12,34,17,89,45,87,1,5,6,3,26))
Id Start End Pos Hits
Entry_1 20 50 14 12
Entry_1 20 50 34 34
Entry_1 20 50 21 17
Entry_2 37 78 50 89
Entry_2 37 78 18 45
Entry_2 37 78 70 87
Entry_3 68 200 101 1
Entry_4 10 94 35 5
Entry_4 10 94 2 6
Entry_4 10 94 56 3
Entry_4 10 94 67 26
For each entry I would like to iterate the data.frame in 3 different modes. For an example, for Entry_1 mode_1 =seq(20,50,3)and mode_2=seq(21,50,3) and mode_3=seq(22,50,3). I would like sum all the Values in Column "Hits" whose corresponding values in Column "Pos" that falls in mode_1 or_mode_2 or mode_3 and generate a data.frame like follow:
Id Mode_1 Mode_2 Mode_3
Entry_1 0 17 34
Entry_2 87 89 0
Entry_3 1 0 0
Entry_4 26 8 0
I tried the following code:
mode_1=0
mode_2=0
mode_3=0
mode_1_sum=0
mode_2_sum=0
mode_3_sum=0
for(i in dim(df)[1])
{
if(df$Pos[i] %in% seq(df$Start[i],df$End[i],3))
{
mode_1_sum=mode_1_sum+df$Hits[i]
print(mode_1_sum)
}
mode_1=mode_1_sum+counts
print(mode_1)
ifelse(df$Pos[i] %in% seq(df$Start[i]+1,df$End[i],3))
{
mode_2_sum=mode_2_sum+df$Hits[i]
print(mode_2_sum)
}
mode_2_sum=mode_2_sum+counts
print(mode_2)
ifelse(df$Pos[i] %in% seq(df$Start[i]+2,df$End[i],3))
{
mode_3_sum=mode_3_sum+df$Hits[i]
print(mode_3_sum)
}
mode_3_sum=mode_3_sum+counts
print(mode_3_sum)
}
But the above code only prints 26. Can any one guide me how to generate my desired output, please. I can provide much more details if needed. Thanks in advance.
It's not an elegant solution, but it works.
m <- 3 # Number of modes you want
foo <- ((df$Pos - df$Start)%%m + 1) * (df$Start < df$Pos) * (df$End > df$Pos)
tab <- matrix(0,nrow(df),m)
for(i in 1:m) tab[foo==i,i] <- df$Hits[foo==i]
aggregate(tab,list(df$Id),FUN=sum)
# Group.1 V1 V2 V3
# 1 Entry_1 0 17 34
# 2 Entry_2 87 89 0
# 3 Entry_3 1 0 0
# 4 Entry_4 26 8 0
-- EXPLANATION --
First, we find the indices of df$Pos That are both bigger than df$Start and smaller than df$End. These should return 1 if TRUE and 0 if FALSE. Next, we take the difference between df$Pos and df$Start, we take mod 3 (which will give a vector of 0s, 1s and 2s), and then we add 1 to get the right mode. We multiply these two things together, so that the values that fall within the interval retain the right mode, and the values that fall outside the interval become 0.
Next, we create an empty matrix that will contain the values. Then, we use a for-loop to fill in the matrix. Finally, we aggregate the matrix.
I tried looking for a quicker solution, but the main problem I cannot work around is the varying intervals for each row.

Apply over all columns and rows of two diffrent dataframes in R

I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45

Applying function to multiple rows using values from multiple rows

I have created the following simple function in R:
fun <- function(a,b,c,d,e){b+(c-a)*((e-b)/(d-a))}
That I want to apply this function to a data.frame that looks something like:
> data.frame("x1"=seq(55,75,5),"x2"=round(rnorm(5,50,10),0),"x3"=seq(30,10,-5))
x1 x2 x3
1 55 51 30
2 60 45 25
3 65 43 20
4 70 57 15
5 75 58 10
I want to apply fun to each separate row to create a new variable x4, but now comes the difficult part (to me at least..): for the arguments d and e I want to use the values x2 and x3 from the next row. So for the first row of the example that would mean: fun(a=55,b=51,c=30,d=45,e=25). I know that I can use mapply() to apply a function to each row, but I have no clue on how to tell mapply that it should use some values from the next row, or whether I should be looking for a different approach than mapply()?
Many thanks in advance!
Use mapply, but shift the fourth and fifth columns by one row. You can do it manually, or use taRifx::shift.
> dat
x1 x2 x3
1 55 25 30
2 60 58 25
3 65 59 20
4 70 68 15
5 75 43 10
library(taRifx)
> shift(dat$x2)
[1] 58 59 68 43 25
> mapply( dat$x1, dat$x2, dat$x3, shift(dat$x2), shift(dat$x3) , FUN=fun )
[1] 25.00000 -1272.00000 719.00000 -50.14815 26.10000
If you want the last row to be NA rather than wrapping, use wrap=FALSE,pad=TRUE:
> shift(dat$x2,wrap=FALSE,pad=TRUE)
[1] 58 59 68 43 NA

How can I get column data to be added based on a group designation using R?

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.
One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134
Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))
I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Resources