Records:-
UniqueID Country Price
AAPL USA 107
AAPL USA 105
GOOG USA 555
GOOG USA 555
VW DEU 320
Mapping:-
UniqueID Country Price
AAPL USA 120
GOOG USA 550
VW DEU 300
I want to add a column Final and map the values from the mapping table to the records tables . For e.g. all the AAPL entries in the records table should have a final value of 120.
Output:-
Records:-
UniqueID Country Price Final
AAPL USA 107 120
AAPL USA 105 120
GOOG USA 555 550
GOOG USA 555 550
VW DEU 320 300
I used the following line of code:-
Records$Final <- Mapping[which(Records$UniqueID==Mapping$UniqueID),"Price"]
It throws me an error saying the replacement and data length are different. Also using merge duplicates the columns, which I don't want to.
We can use inner_join,
library(dplyr)
inner_join(records, Mapping, by = c('UniqueID', 'Country'))
# UniqueID Country Price.x Price.y
#1 AAPL USA 107 120
#2 AAPL USA 105 120
#3 GOOG USA 555 550
#4 GOOG USA 555 550
#5 VW DEU 320 300
To follow your method then,
Records$Final <- Mapping$Price[match(Records$UniqueID, Mapping$UniqueID)]
Records
# UniqueID Country Price Final
#1 AAPL USA 107 120
#2 AAPL USA 105 120
#3 GOOG USA 555 550
#4 GOOG USA 555 550
#5 VW DEU 320 300
First, in the Mapping table rename the column Price to Final
colnames(Mapping)[colnames(Mapping) == "Price"] <- "Final"
Then, use merge(). You should be getting what you wanted
Records=data.frame(UniqueID=c("AAPL","AAPL","GOOG","GOOG","VW"),country=c("USA","USA","USA","USA","DEU"),Price=c(107,105,555,555,320))
Mapping=data.frame(UniqueID=c("AAPL","GOOG","VW"),country=c("USA","USA","DEU"),Price=c(120,550,300))
names(Mapping)[3] <- "Final"
Output <- merge(x=Records,y=Mapping[,c(1,3)],by="UniqueID",all.x=TRUE)
Related
I have three datasets
one containing a bunch of information about storms.
one that contains full names of the cities and the abbreviations.
and one that contains the year and population for each state.
What I want to do is to add a column to the first dataframe storms called population that contains population per year for each state using the other two dataframes state_codes and states.
Can anyone point me in the right direction? Below some sample data
> head(storms)
num yr mo dy time state magnitude injuries fatalities crop_loss
1 1 1950 1 3 11:00:00 MO 3 3 0 0
2 1 1950 1 3 11:10:00 IL 3 0 0 0
3 2 1950 1 3 11:55:00 IL 3 3 0 0
4 3 1950 1 3 16:00:00 OH 1 1 0 0
5 4 1950 1 13 05:25:00 AR 3 1 1 0
6 5 1950 1 25 19:30:00 MO 2 5 0 0
> head(state_codes)
Name Abbreviation
1 Alabama AL
2 Alaska AK
3 Arizona AZ
4 Arkansas AR
5 California CA
6 Colorado CO
head(states)
Year Alabama Arizona Arkansas California Colorado Connecticut Delaware
1 1900 1830 124 1314 1490 543 910 185
2 1901 1907 131 1341 1550 581 931 187
3 1902 1935 138 1360 1623 621 952 188
4 1903 1957 144 1384 1702 652 972 190
5 1904 1978 151 1419 1792 659 987 192
6 1905 2012 158 1447 1893 680 1010 194
You didn't provide much data to test with, but this should do it.
First, join storms to state_codes, so that it will have state names that are in states. We can rename yr to match states$Year at the same time.
Then pivot states to be in long form.
Finally, join the new version of storms to the long version of states.
library(dplyr)
library(tidyr)
storms %>%
left_join(state_codes,by = c("state" = "Abbreviation")) %>%
rename(Year = yr) -> storms.with.names
states %>%
pivot_longer(-Year, names_to = "Name",
values_to = "Population") -> long.states
storms.with.names %>%
left_join(long.states) -> result
This answer doesn't use dplyr, but I'm offering it because I know that it's very fast on large datasets.
It follows the same first step as the accepted answer: merge state names into the storms dataset. But then it does something clever (I stole the idea): it creates a matrix of row and column numbers, and then uses that to extract the elements from the "states" dataset that you want for the new column.
#Add the state names to storms
storms<-merge(storms, state_codes, by.x = 6, by.y = 2, all.x = T)
#Get row and column indexes for the elements in 'states'
r<-match(storms$year, states$year)
c<-match(storms$state.y, names(states)) #state.y was the name of the merged column
smat<-cbind(r,c)
#And grab them into a new vector
storms$population<-states[smat]
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a large data set that I have imported from Excel to R. I want to get all the entries that have a negative value for a specific variable, MG. I use the code:
A <- subset(df, MG < 0)
However, A becomes empty, despite the fact that there are several entries with a value below 0. This is not the case when I am looking for values larger than 0, < 0. It should be added that there are N/A values in the data, but adding na.rm = TRUE does not help.
I also notice that R treats MG as a binary true/false variable since it sometimes contains 1 and 0.
Any idea what I have done wrong?
edit:
Country Region Code Product name Year Value MG
Sweden Stockholm 123 Apple 1991 244 NA
Sweden Kirruna 123 Apple 1987 100 NA
Japan Kyoto 543 Pie 1987 544 NA
Denmark Copenhagen 123 Apple 1998 787 0
Denmark Copenhagen 123 Apple 1987 100 1
Denmark Copenhagen 543 Pie 1991 320 0
Denmark Copenhagen 126 Candy 1999 200 1
Sweden Gothenburg 126 Candy 2013 300 0
Sweden Gothenburg 157 Tomato 1987 150 -55
Sweden Stockholm 125 Juice 1987 250 150
Sweden Kirruna 187 Banana 1998 310 250
Japan Kyoto 198 Ham 1987 157 1000
Japan Kyoto 125 Juice 1987 550 -1
Japan Tokyo 125 Juice 1991 100 0
From your comments it looks like you're using read_excel to read in the data. It only reads a few rows to try to figure out what type the data probably is. You can bypass the part where it "guesses" so that when it reads in it knows that MG is numeric.
df <- read_excel("Test/df.xlsx",
col_types = c("text", "text", "numeric", "text", "numeric", "numeric", "numeric"))
Hey everyone I have a dataset with about 8 for which I want to calculate the largest volume for each combination of city and year.
The dataset looks like this:
city sales volume year avg price
abilene 239 12313 2000 7879
kansas 2324 18765 2000 2424
nyc 2342 987651 2000 3127
abilene 3432 34342 2001 1234
nyc 2342 10000 2001 3127
kansas 176 3130 2001 879
kansas 123 999650 2002 2424
abilene 3432 34342 2002 1234
nyc 2342 98000 2002 3127
I want my dataset to look like this :
city year volume
nyc 2000 987651
abilene 2001 34342
kansas 2002 999650
I used the ddplyr to find the maximum volume of each city.
newdf=ddply(df,c('city','year'),summarise, max(volume))
However this gives me a dataset with maximum value of each city for each year. However, I just want to know the maximum volume comparing all cities for an year. Thank you.
library(dplyr)
df %>% #df is your dataframe
group_by(year)%>%
filter(volume==max(volume))
Source: local data frame [3 x 5]
Groups: year
city sales volume year avg_price
1 nyc 2342 987651 2000 3127
2 abilene 3432 34342 2001 1234
3 kansas 123 999650 2002 2424
#updated : If you are grouping by both city and year
df %>% #df is your dataframe
group_by(year,city)%>%
filter(volume==max(volume))
Source: local data frame [9 x 5]
Groups: year, city
city sales volume year avg_price
1 abilene 239 12313 2000 7879
2 kansas 2324 18765 2000 2424
3 nyc 2342 987651 2000 3127
4 abilene 3432 34342 2001 1234
5 nyc 2342 10000 2001 3127
6 kansas 176 3130 2001 879
7 kansas 123 999650 2002 2424
8 abilene 3432 34342 2002 1234
9 nyc 2342 98000 2002 3127
I am new to R. I have two data frames as
PriceData
Date AAPL MSFT GOOG
12/3/2014 100 45 522
12/2/2014 99 45 517
12/1/2014 97 45 511
11/28/2014 97 44 508
QuantityData
Symbol Position
MSFT 1000
AAPL 1200
GOOG 1300
Now I want to calculate market value. So output should be like this
Date AAPL MSFT GOOG
12/3/2014 120000 45000 678600
12/2/2014 118800 45000 672100
12/1/2014 116400 45000 664300
11/28/2014 116400 44000 660400
You can try
indx <- match(colnames(PriceData)[-1], QuantityData$Symbol)
PriceData[,-1][,indx] <- PriceData[,-1][,indx]*
QuantityData[,2][col(PriceData[,-1])]
PriceData
# Date AAPL MSFT GOOG
#1 12/3/2014 120000 45000 678600
#2 12/2/2014 118800 45000 672100
#3 12/1/2014 116400 45000 664300
#4 11/28/2014 116400 44000 660400
Or
PriceData[,-1][,indx] <- t(t(PriceData[,-1][,indx])*QuantityData[,2])
I've had this problem before, but I didn't write down the solution, so now I'm in trouble again!
I have a dataframe like the following:
Date Product Qty Income
201001 0001 1000 2000
201002 0001 1500 3000
201003 0001 1200 2400
.
.
201001 0002 3500 2000
201002 0002 3200 1900
201003 0002 3100 1850
In words, I have one line for each combination of Date/Product, and the information of Quantity and Income for each combination.
I want to rearrange this dataframe so it looks like the following:
Date Qty.0001 Income.0001 Qty.0002 Income.0002
201001 1000 2000 3500 2000
201002 1500 3000 3200 1900
201003 1200 2400 3100 1850
In words, I want to have one line for each date, and one column for each combination of Product/Information(Qty, Income).
How can I achieve this? Thanks in advance!
Use reshape:
reshape(x,idvar="Date",timevar="Product",direction="wide")
Date Qty.0001 Income.0001 Qty.0002 Income.0002
1 201001 1000 2000 3500 2000
2 201002 1500 3000 3200 1900
3 201003 1200 2400 3100 1850