Got an error using ifelse inside mutate inside the for loop - r

I have a list of 244 data frames which looks like the following:
The name of the list is datas.
datas[[1]]
year sal
2000 10000
2000 15000
2005 10000
2005 9000
2005 12000
2010 15000
2010 12000
2010 20000
2013 25000
2013 15000
2015 20000
I would like to make a new column called fix.sal, multiplying different values for different years. For example, I multiply 2 on sals which are on the same rows with 2000. In the same way, the number multiplied on the sal value is 1.8 for 2005, 1.5 for 2010, 1.2 for 2013, 1 for 2015. So the result should be like this:
Year sal fix.sal
2000 10000 20000
2000 15000 30000
2005 10000 18000
2005 9000 16200
2005 12000 21600
2010 15000 22500
2010 12000 18000
2010 20000 30000
2013 25000 30000
2013 15000 18000
2015 20000 20000
I succeeded to do this by using ifelse inside mutate which for package dplyr.
library(dplyr)
datas[[1]]<-mutate(datas[[1]], fix.sal=
ifelse(datas[[1]]$Year==2000,datas[[1]]$sal*2,
ifelse(datas[[1]]$Year==2005,datas[[1]]$sal*1.8,
ifelse(datas[[1]]$Year==2010,datas[[1]]$sal*1.5,
ifelse(datas[[1]]$Year==2013,datas[[1]]$sal*1.2,
datas[[1]]$sal*1)))))
But I have to do this operation to the 244 data frames in the list datas.
So I tried to do it using the for loop like this;
for(i in 1:244){
datas[[i]]<-mutate(datas[[i]], fix.sal=
ifelse(datas[[i]]$Year==2000,datas[[i]]$sal*2,
ifelse(datas[[i]]$Year==2005,datas[[i]]$sal*1.8,
ifelse(datas[[i]]$Year==2010,datas[[i]]$sal*1.5,
ifelse(datas[[i]]$Year==2013,datas[[i]]$sal*1.2,
datas[[i]]$sal*1)))))
}
Then there came an error;
Error: invalid subscript type 'integer'
How can I solve this...?
Any comments will be greatly appreciated! :)

Please don't force yourself to use ifelse for this. Instead, create a vector with your multipliers, then use the year to select from the vector. The vector will look something like this:
multiplier <-
c("2005" = 1.2
, "2006" = 1.05
, "2007" = 0.9)
With whatever your multiplier is for each year in your data. Then, here is some sample data (all the same, but that doesn't matter):
datas <-
lapply(1:3, function(idx){
data.frame(
Year = 2005:2007
, sal = c(10, 20, 30)
)
})
Finally, we can then use lapply to loop through the list more efficiently. Each time through, it uses the Year to pick a value from the multipliers vector (note the use of as.character, otherwise it will pick, e.g., the 2005th entry, instead of the one named "2005").
lapply(datas, function(x){
mutate(x, fix.sal = sal*multiplier[as.character(Year)])
})
returns:
[[1]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[2]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
[[3]]
Year sal fix.sal
1 2005 10 12
2 2006 20 21
3 2007 30 27
For more compact code, you can use:
lapply(datas, mutate, fix.sal = sal*multiplier[as.character(Year)])
but that makes it slightly less clear to me what is happening.

Here's a simple solution using ifelse and lapply:
# Creating the list
df <- data.frame(year=c(rep(2000,2),rep(2005,3),rep(2010,3),rep(2013,2),2015),
sal=c(10000,15000,10000,9000,12000,15000,12000,20000,25000,15000,20000))
datas <- list(df,df)
# Applying the function with ifelse
lapply(datas,function(x){
outp <- ifelse(df$year==2000,df$sal*2,
ifelse(df$year==2005,df$sal*1.8,
ifelse(df$year==2010,df$sal*1.5,
ifelse(df$year==2013,df$sal*1.2,df$sal*1))))
return(outp)
})
You'll get the result for each df inside the list.

Related

Rearranging data columns in R

I have an excel file that contains two columns : Car_Model_Year and Cost.
Car_Model_Year Cost
2018 25000
2010 9000
2005 13000
2002 35000
1995 8000
I want to sort my data as follows:
Car_Model_Year Cost
1995 8000
2002 35000
2005 13000
2010 9000
2018 25000
So now, the Car_Model_Year are sorted in ascending order. I wrote the following R code, but I don't know how to rearrange the values of the variable Cost accordingly.
my_data <- read.csv2("data.csv")
my_data <- sort(my_data$Car_Model_Year, decreasing = FALSE)
Any help will be very appreciated!
Are you looking for this?
sorted_df <- df[order(df$Car_Model_Year, df$Cost),]
print(sorted_df)
# A tibble: 5 x 2
Car_Model_Year Cost
<dbl> <dbl>
1 1995 8000
2 2002 35000
3 2005 13000
4 2010 9000
5 2018 25000
Note that you can use signs (+/ -) to indicate asc or desc:
# Sort by car_model(descending) and cost(acending)
sorted_df <-df[order(-df$Car_Model_Year, df$Cost),]
Does the below approach work? To sort by two or more columns, you just add them to the order() - i.e. order(var1, var2,...)
my_data <- data.frame(Car_Model_Year=c(2018,2010,2005,2002,1995),
Cost=c(25000,9000,13000,35000,8000))
sorted <- my_data[order(my_data$Car_Model_Year, my_data$Cost),]
> print(sorted)
Car_Model_Year Cost
5 1995 8000
4 2002 35000
3 2005 13000
2 2010 9000
1 2018 25000
dplyr::arrange() makes it easy:
library(dplyr)
my_data %>% arrange(Car_Model_Year, Cost)
Descending price instead:
my_data %>% arrange(Car_Model_Year, desc(Cost))

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

how can I split a dataframe by two columns and count number of rows based on group more efficient

I have a data.frame with more than 120000 rows, it looks like this
> head(mydf)
ID MONTH.YEAR VALUE
1 110 JAN. 2012 1000
2 111 JAN. 2012 1000
3 121 FEB. 2012 3000
4 131 FEB. 2012 3000
5 141 MAR. 2012 5000
6 142 MAR. 2012 4000
and I want to split the data.frame depend on the MONTH.YEAR and VALUE column, and count the rows of each group, my expect answer should looks like this
MONTH.YEAR VALUE count
JAN. 2012 1000 2
FEB. 2012 3000 2
MAR. 2012 5000 1
MAR. 2012 4000 1
I tried to split it and use the sapply count the number of each group, and this is my code
sp <- split(mydf, list(mydf$MONTH.YEAR, mydf$VALUE), drop=TRUE);
result <- data.frame(yearandvalue = names(sapply(sp, nrow)), count = sapply(sp, nrow))
but I find the process is very slow. Is there a more efficient way to impliment this? thank you very much.
Try
aggregate(ID~., mydf, length)
Or
library(dplyr)
mydf %>%
group_by(MONTH.YEAR, VALUE) %>%
summarise(count=n())
Or
library(data.table)
setDT(mydf)[, list(count=.N) , list(MONTH.YEAR, VALUE)]

Grouping and Std. Dev in R

I have a data frame called dt. dt looks like this.
Year Sale
2009 6
2008 3
2007 4
2006 5
2005 12
2004 3
I am interested in getting std.dev of sales in the past four years. In case, there are not four year data, as in 2006,2005, and 2004, I want to get NA. How can I create a new column with the values corresponding to each year. New data would look like.
Year Sale std.
2009 6 std(05,06,07,08)
2008 3 std(07,06,05,04)
2007 4 NA
2006 5 NA
2005 12 NA
2004 3 NA
I tried this a lot, but because I am a novice at R, I couldn't do it. Someone please help. Thanks.
Edit :
Here is the data with GVKEY.
GVKEY FYEAR IBC
1 1004 2003 3.504
2 1004 2004 18.572
3 1004 2005 35.163
4 1004 2006 59.447
5 1004 2007 75.745
Regards
Edit:
I am using the mentioned function rollapply function in this manner:
dt <- ddply(dt, .(GVKEY), function(x){x$ww <- rollapply(x$Sale,4,sd, fill =NA, align="right"); x});
But I am getting following error.
Error in seq.default(start.at, NROW(data), by = by) : wrong sign in 'by' argument
Not sure what I am doing wrong. The data with GVKEY is mentioned at the top.
You can use rollapply from package zoo:
require(zoo)
rollapply(df$Sale, 4, sd, fill=NA, align="right")
[edit] I used your data frame as sorted by year. If you have it in original order, you will probably need to use align="left"
This is how I solved the problem:
dt <- dt[order(dt$GVKEY,dt$FYEAR),];
dt <- sqldf("select GVKEY, FYEAR, IBC from dt");
dt$STDEARN <- ave(dt$IBC, dt$GVKEY,FUN = function(x) {if(length(x)>3) c(NA,head(runSD(x,4),-1)) else sample(NA,length(x),TRUE)});

Refer to relative rows in R

I know this answer must be out there, but I can't figure out how to word the question.
I'd like to calculate the differences between values in my data.frame.
from this:
f <- data.frame(year=c(2004, 2005, 2006, 2007), value=c(8565, 8745, 8985, 8412))
year value
1 2004 8565
2 2005 8745
3 2006 8985
4 2007 8412
to this:
year value diff
1 2004 8565 NA
2 2005 8745 180
3 2006 8985 240
4 2007 8412 -573
(ie value of current year minus value of previous year)
But I don't know how to have a result in one row that is created from another row. Any help?
Thanks,
Tom
There are many different ways to do this, but here's one:
f[, "diff"] <- c(NA, diff(f$value))
More generally, if you want to refer to relative rows, you can use lag() or do it directly with indexes:
f[-1,"diff"] <- f[-1, "value"] - f[-nrow(f), "value"]
Use the diff function
f <- cbind(f, c(NA, diff(f[,2])))
If year column isn't sorted then you could use match:
f$diff <- f$value - f$value[match(f$year-1, f$year)]

Resources