Go through an R dataframe and increase the salary at certain conditions - r

I have this sample of my dataframe (df):
age salary
1 25 20000
2 35 22000
3 31 23500
4 24 19200
5 27 27900
6 32 31010
I want to increase the salary by 11% for people who are aged above 30 and their salary is not the maximum salary in the table. I wrote this loop:
for(row in df){
if (row$age > 30 & row$salary != max(df$salary)){
row$salary = row$salary * 0.11
}
}
but I get less than the salaries posted rather than an increase.
Would really appreciate any help.

Here's one way without explicit ifelse. Should be one of the fastest ways to do this -
df$new_salary <- with(df, salary + 0.11*salary*(age > 30)*(salary != max(salary))

The reason why you're experiencing problems is because in each iteration of the for loop (specifically, when going through each matching row), you're applying a transformation to the entire column. Try this instead:
k <- max(df$age)
df[df$age>30 & df$age<k, 'salary'] <- df[df$age>30 & df$age<k, 'salary'] * 1.11

Related

R subset dataframe, skip part of interval

I'm tryng to subset my total data (including all the other varibales) to an interval of zipcodes EXCLUDING a certain part of that interval. Quite new to R and can't get it to work. (Zipcode = postnr)
I have over 100 000 zipcodes (postnr) and want all values for individs in zipcode 10 000-12 999 and 15 600 - 16 800 in my dataset
Attempt 1
Datan <- subset(Data2, Data2$postnr >= 10000 & Data2$postnr <= 16880)
Datant <- subset(Datan, Datan$postnr >= 15600 & Datan$postnr < 13000)
Datan returns 31 3000 obs in 26 variabels and Datant returns 0 obs in 26 variabels..
Attempt 2
attach(Data2)
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) & between(postnr, 15600, 16880))
Data 5 returns 0 obsverations...
I have thousands of values for all my variables inside those intervals. What am I doing wrong?
If you think about and versus or you have gotten it. As it is, you're really close!
Can a number be between 1 and 2 and 3 and 5? Nope. But if I said, can a number be between 1 and 2 or 3 and 5? Yup.
Updated
For subset:
Datan <- subset(Data2, postnr >= 10000 & postnr <= 13000 |
postnr >= 15600 & postnr < 16800)
Where that verticle pipe: | means 'or'.
For dplyr:
(I assume it's dplyr with filter.) You don't need to attach the data, it will extract the variable names from Data2 if it's in the pipe (which it is).
Data5 <- Data2 %>% filter(between(postnr, 10000, 12999) |
between(postnr, 15600, 16880))
I have no data, so I can not properly test this, but the following should work.
Note the or operator (|) to specify two different conditions.
library(data.table)
dt <- as.data.table(Data2)
dt[(postnr>10000&postnr<13000)|(postnr>15600&postnr<=16880),]

Replace values in column with previous year's value based on another column

I want to create a for loop that will replace a value in a row with a value from the previous year (same month), based on if two columns are matching.
I have created the structure of a for loop, but I have not made progress in determining how to get the for loop to reference a value from a previous year.
Here is an example dataset:
fish <- c("A","A","B","B","C","C")
fish_wt<-c(2,3,4,5,5,7)
fish_count<-c(2,200,47,78,5,845)
date <- as.Date(c('2010-11-1','2009-11-1','2009-11-1','2008-11-1','2008-2-1','2007-2-1'))
data <- data.frame(fish,fish_wt,fish_count,date)
data$newcount<-0
Here is my for loop so far:
for (i in 1:nrow(data)) {
if (data$fish_wt[i] == data$fish_count[i]) {
data$newcount[i] <- 10
} else {
data$newcount[i] <- data$fish_count[i]
}
}
Right now, I am using the value of the row-1, which is fine for this small dataset, but does not work for a larger one where the two fish A rows will not be next to one another.
for (i in 1:nrow(data)) {
if (data$fish_wt[i] == data$fish_count[i]) {
data$newcount[i] <- data$newcount[data$date==data$date[i-1])]
} else {
data$newcount[i] <- data$fish_count[i]
}
}
This is what I want my dataset to look like:
fish fish_wt fish_count date newcount
1 A 2 2 2010-11-01 200
2 A 3 200 2009-11-01 200
3 B 4 47 2009-11-01 47
4 B 5 78 2008-11-01 78
5 C 5 5 2008-02-01 845
6 C 7 845 2007-02-01 845
I have thought of separating rows by fish, then using the row-1 solution. I am just wondering if there is something easier.
As a solution to this problem, I set up a table of mean temperature by fish, year, and month (long format), then merged the dataset and used the average value for any row where fish_wt==fish_count.

Using variables from two different size datasets (and logic relations) to create a new variable

I have two dataframes. The number of observations is very different, and I would like to use some information from one dataframe into the other, conditioning to some logical relations, and I can't seem to be able to. A down-scaled example would look something like this:
year <- as.vector(c(rep(1949,5), rep(1950,5), rep(1951,5), rep(1952,5)))
moneyband <- as.vector(c(rep(c(10,20,30,40,50),4)))
rate <-as.vector(c(rep(c(0.1,0.2,0.3,0.4,0.5),2),rep(c(0.15,0.25,0.35,0.45,0.55),2)))
datasmall <- as.data.frame(cbind(year,moneyband,rate))
yearbig <- as.vector(c(rep(1949,10), rep(1950,10), rep(1951,10), rep(1952,11)))
earnings <- as.vector(c(rep(c(9,19,30,39,50),8),60))
databig <- as.data.frame(cbind(yearbig,earnings))
Now I want to create a new variable in the big database (let's call it ratebig) that assigns to that variable the rate associated with that amount of earnings, if earnings (in the big database) equal moneyband (in the small database) for a given year. As you can see, in this example this would happen with the values 30 and 50. The rest I would like them to be NA.
I tried this:
databig$ratebig <- NA
for (i in 1949:1952) {
databig$ratebig[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])] <- datasmall$rate[datasmall$year == i & (databig$earnings[databig$yearbig==i]==datasmall$moneyband[datasmall$year == i])]
}
But the different size of databases (or other things) are giving me trouble (it gives me errors and the results are wrong). It seems the result does not take care the conditions as I would like, and it is influenced by relative position and the structure in the two datasets.
In principle, I wouldn't want to merge the datasets (we are talking about a high number of observations in the real data) and was hoping for a way to do this.
Thanks!!
For your case merge works fine
merge(databig, datasmall, by.x = c("yearbig", "earnings"),
by.y = c("year", "moneyband"), all.x = TRUE)
# yearbig earnings rate
#1 1949 9 NA
#2 1949 9 NA
#3 1949 19 NA
#4 1949 19 NA
#5 1949 30 0.30
#6 1949 30 0.30
#7 1949 39 NA
#8 1949 39 NA
#9 1949 50 0.50
#10 1949 50 0.50
#.....
Regarding why your for loop doesn't work as expected you need to do it for every row of databig
databig$ratebig <- NA
for (i in 1:nrow(databig)) {
inds <- databig$yearbig[i] == datasmall$year &
databig$earnings[i] == datasmall$moneyband
if (any(inds))
databig$ratebig[i] <- datasmall$rate[inds]
}

Unable to store loop output in R

I am trying to store the loop output. However, my dataset is quite big and it crashes Rstudio whenever I try to View it. I have tried different techniques such as the functions in library(iterators) and library(foreach), but it is not doing what I want it to do. I am trying to take a row from my main table (Table A)(number of rows 54000) and then a row from another smaller table (Table B)(number of rows = 6). I have also took a look at Storing loop output in a dataframe in R but it doesn't really allow me to view my results.
The code takes the first row from Table A and then iterates it 6 times through table B and then outputs the result of each iteration then moves to Table A's second row. As such my final dataset should contain 324000 (54000*6) observations.
Below is the code that provides me with the correct observations (but I am unable to view it to see it the values are being correctly calculated) and a snippet of Table A and Table B.
output_ratios <- NULL
for (yr in seq(yrs)) {
if (is.na(yr) == 'TRUE') {
numerator=0
numerator1=0
numerator2=0
denominator=0
} else {
numerator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("1")]
denominator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("2")]
denom=Table.A[, "1"] + (abs(Table.A[, "1"])*denominator)
num=Table.A[, "2"] + (abs(Table.A[, "2"])*numerator)
new.data$1=num
new.data$2=denom
NI=num / denom
NI_ratios$NI=c(NI)
output_ratios <<- (rbind(output_ratios, NI))
}
}
TABLE B:
PERIOD 1 2 3 4 5
1 PY_1 0.21935312 -0.32989691 0.12587413 -0.28323699 -0.04605116
2 PY_2 0.21328526 0.42051282 -0.10559006 0.41330645 0.26585064
3 PY_3 -0.01338112 -0.03971119 -0.06641667 -0.08238231 -0.05323772
4 PY_4 0.11625091 0.01127819 0.07114166 0.08501516 0.55676498
5 PY_5 -0.01269256 -0.02379182 0.39115278 -0.03716100 0.63530682
6 PY_6 0.69041864 0.51034273 0.59290357 0.78571429 -0.48683736
TABLE A:
1 2 3 4
1 25 3657 2258
2 23 361361 250
3 24 35 000
4 25 362 502
5 25 1039 502
I would greatly appreciate any help.

R : Create specific bin based on data range

I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107

Resources