My Question is similar to this, but I my problem is somehow easier, and so I hope for easier solutions. How to merge two datasets based on similar but not exact time variable written in string. using R?
My dataframes look similar to this:
a <- data.frame(ID = 1:4,
EG = c("CA", "EV", "EV", "TR"),
year =c(2000, 2005, 2010, 2020), test = sample(4))
b <- data.frame(ID = 1:4,
EG = c("CA", "EV", "EV", "TR"),
test = sample(20),
year = sample(2000:2019, 20, replace=TRUE))
Now I would like to preform a left join like merge(b, a, by=c("ID", "EG", "year"), all.x=TRUE). But I want: if the year in a is not found in b, than the value of a shall matched to b where the year B is closesd (in conflict round off). That mean at the end all "ID", "EG", "year" columns in the dataframe should have a test value from the closesd year B.
In the devel version of dplyr, we can use join_by with closest
library(dplyr)
left_join(b, a, by = join_by(ID, EG, closest(year <= year)))
Related
I have a dataframe with the count of for instance males and females for certain groups arranged in this way:
df <- data.frame (
Round = c("R1", "R1", "R2", "R2"),
N. = c(20, 10, 15,15),
Gender = c("M", "F", "M","F"))
How can I create a table accounting for counts over, for instance, Round and Gender? I would like to show the distribution of gender for each round.
I have tried
table (df$Gender, df$Round)
but this is not what I need. I need instead to show N. by groups.
Something like this?
library(tidyr)
pivot_wider(df, names_from = Round, values_from = N.)
Gender R1 R2
1 M 20 15
2 F 10 15
Or in base R with reshape:
reshape(df, direction = "wide", idvar = "Gender", timevar = "Round")
I have a dataframe (df1) that contains Start times and End times for observations of different IDs:
df <- structure(list(ID = 1:4, Start = c("2021-05-12 13:22:00", "2021-05-12 13:25:00", "2021-05-12 13:30:00", "2021-05-12 13:42:00"),
End = c("2021-05-13 8:15:00", "2021-05-13 8:17:00", "2021-05-13 8:19:00", "2021-05-13 8:12:00")),
class = "data.frame", row.names = c(NA,
-4L))
I want to create a new dataframe that shows the latest Start time and the earliest End time for each possible pairwise comparison between the levels ofID.
I was able to accomplish this by making a duplicate column of ID called ID2, using dplyr::expand to expand them, and saving that in an object called Pairs:
library(dplyr)
df$ID2 <- df$ID
Pairs <-
df%>%
expand(ID, ID2)
Making two new objects a and b that store the Start and End times for each comparison separately, and then combining them into df2:
a <- left_join(df, Pairs, by = 'ID')%>%
rename(StartID1 = Start, EndID1 = End, ID2 = ID2.y)%>%
select(-ID2.x)
b <- left_join(Pairs, df, by = "ID2")%>%
rename(StartID2 = Start, EndID2 = End)%>%
select(ID2, StartID2, EndID2)
df2 <- cbind(a,b)
df2 <- df2[,-4]
and finally using dplyr::if_else to find the LatestStart time and the EarliestEnd time for each of the comparisons:
df2 <-
df2%>%
mutate(LatestStart = if_else(StartID1 > StartID2, StartID1, StartID2),
EarliestEnd = if_else(EndID1 > EndID2, EndID2, EndID1))
This seems like such a simple task to perform, is there a more concise way to achieve this from df1 without creating all of these extra objects?
For such computations usually outer comes handy:
df %>%
mutate(across(c("Start", "End"), lubridate::ymd_hms)) %>%
{
data.frame(
ID1 = rep(.$ID, each = nrow(.)),
ID2 = rep(.$ID, nrow(.)),
LatestStart = outer(.$Start, .$Start, pmax),
LatestEnd = outer(.$End, .$End, pmin)
)
}
I have a dateset that has a little over 1.32 million observations. I am trying to add a "growth.factor" column to the dataset that sets a specific value given the county and classification from another dataset called "cat.growth" which is 44x8.
I need to run the following code 352 times---changing the county and classification names---to get my desired result (44 counties, 8 different classifications):
parcel.data.1$growth.factor <- ifelse(parcel.data.1$classification == "Ag" & parcel.data.1$county == "Ada", 1 + cat.growth["Ada","Ag"], parcel.data.1$growth.factor)
If I do so, it takes approximately 16.7 seconds to run. But It takes up 352 lines of code. I can achieve the same thing in 4 lines of code with this for loop:
for (x in parcel.data.1) {
for (y in parcel.data.1$classification) {
parcel.data.1$growth.factor <- ifelse(parcel.data.1$classification == y & parcel.data.1$county == x, 1 + cat.growth[x,y], parcel.data.1$growth.factor)
}}
But when I run it, I cant even get it to complete (I gave up after 12 minutes). I've tried using all my cores in my Mac using:
library(foreach)
library(doSNOW)
c1 <- makeCluster(8, type = "SOCK")
registerDoSNOW(c1)
But that didn't help. I've looked at all the blogs and other posts regarding slow loops, but my code is only a single line so I didn't see anything that applied to making to faster in those other suggestions.
Any help getting this loop to run in less than a minute would be extremely appreciated.
As others have pointed out, you shouldn't use a loop. But your question seems to be "Why is this loop taking so long?"
The answer is that you seem to be looping over all 1.36 million elements of parcel.data.1$county and all 1.36 million elements of parcel.data.1$classification. This means that your loop is evaluating ifelse()
1360000^2 times, not 352 times.
If you are going to use a loop, then loop over the unique elements of each column, which are given by the row names and column names of cat.growth.
for (x in rownames(cat.growth)) { # loop over counties
for (y in colnames(cat.growth)) { # loop over classifications
...
}
}
This loop is equivalent to your original script with 352 lines of code, so it should have roughly the same run time of ~16 seconds.
Note that if you didn't already know the unique elements of those two vectors, then you could use unique() to find them.
This looks like the reason joins were created, and the dplyr package is what you want.
I don't have your data, but based on your code, I have assembled some simple fake data that looks like it may be structured like yours.
df1 <- data.frame(x = c("Ag", "Ag", "Be", "Be", "Mo", "Mo"),
y = c("A", "B", "A", "B", "A", "B"))
df2 <- data.frame(x = c("Ag", "Be", "Mo"),
A = c(1, 2, 3),
B = c(4, 5, 6))
library(dplyr)
library(tidyr)
df1 %>%
inner_join(df2 %>% pivot_longer(cols = c(A, B), names_to = "y")) %>%
mutate(value = value + 1)
Joining, by = c("x", "y")
x y value
1 Ag A 2
2 Ag B 5
3 Be A 3
4 Be B 6
5 Mo A 4
6 Mo B 7
The loop is probably not the best way of doing this. One alternative may be to reshape the cat.growth data (44 x 8) to a data.frame with variables for county, classification and growth factor (i.e. 352 x 3) and then use "merge" on this and the original data frame.
To illustrate what I mean (based on what I understand your data looks like):
cat.growth <- as.data.frame(matrix(nrow = 44, ncol = 8,
dimnames = list(1:44, letters[1:8]),
data = rnorm(44*8)))
parcel.data <- data.frame(county = sample(1:44, 1e06, replace = TRUE),
classification = sample(letters[1:8], 1e06, replace = TRUE))
cat.growthL = reshape(cat.growth, direction = "long",
idvar = "county",
ids = rownames(cat.growth),
varying = 1:8,
times = colnames(cat.growth),
timevar = "classification",
v.names = "growth.factor")
parcel.data2 = merge(parcel.data, cat.growthL)
I am trying to merge two data frames, dat1 and dat2.
dat1 is an n x 65 data frame, whose rownames correspond to specific values in dat2. Here's a smaller, toy example:
year <- c(1998, 2001, 2002, 2004, 2008)
make <- c("Ford", "Dodge", "Toyota", "Tesla", "Jeep")
model <- c("Escape", "Ram", "Camry", "Model S", "Wrangler")
price <- c(4750.00, 14567.00, 20000.00, 60123.00, 18469.00)
dat1 <- data.frame(year, make, model, price)
dat2 is an nx1 vector with a single column called rownumber. It contains row numbers from dat1 which are of interest. Here's an example:
dat2 <- as.data.frame(c(12, 45, 56, 123))
colnames(dat1)[1] <- "rownumber"
I attempt to merge both data frames as follows:
dat3 <- merge(dat1, dat2, by = "row.names", all.x = TRUE, all.y = FALSE)
The problem is that dat3 contains two columns, row.names and rownumber which do not match. For example, in row one of dat3, row.names = 1, but rownumber = 777.
Any thoughts on how to resolve this? Thanks in advance.
Say you have a data.frame object called dat1 as defined below:
> year <- c(1998, 2001, 2002, 2004, 2008)
> make <- c("Ford", "Dodge", "Toyota", "Tesla", "Jeep")
> model <- c("Escape", "Ram", "Camry", "Model S", "Wrangler")
> price <- c(4750.00, 14567.00, 20000.00, 60123.00, 18469.00)
> dat1 <- data.frame(year, make, model, price)
Say you have a vector rownumbers that defines rows of interest from dat1.
> rownumbers <- c(2,5)
Your question does not state the desired results but assuming you want to subset rows of dat1 such that their row numbers are in the vector rownumbers, one simple way to do this is:
> dat1[rownumbers,] # Don't forget the comma
year make model price
2 2001 Dodge Ram 14567
5 2008 Jeep Wrangler 18469
Edit: you can assign your subset to a new variable dat3 if you'd like.
> dat3 <- dat1[rownumbers,]
Thanks, all for the quick responses, and especially #aashanand for the elegant solution. Here's the quick way:
dat3 <- dat1[dat2$rownumbers, ]
I want to calculate the mean and standard deviation, by group, for each column in a subset of a large data frame.
I'm trying to understand why some of the answers to similar questions aren't working for me; I'm still pretty new at R and I'm sure there are a lot of subtleties (and not-so-subtle things!) I'm completely missing.
I have a large data frame similar to this one:
mydata <- data.frame(Experiment = rep(c("E1", "E2", "E3", "E4"), each = 9),
Treatment = c(rep(c("A", "B", "C"), each = 3), rep(c("A", "C", "D"), each = 3), rep(c("A", "D", "E"), each = 3), rep(c("A", "B", "D"), each = 3)),
Day1 = sample(1:100, 36),
Day2 = sample(1:100, 36),
Day3 = sample(1:150, 36),
Day4 = sample(50:150, 36))
I need to subset the data by Experiment and by Treatment, for example:
testB <- mydata[(mydata[, "Experiment"] %in% c("E1", "E4"))
& mydata[, "Treatment"] %in% c("A", "B"),
c("Treatment", "Day1", "Day2", "Day4")]
Then, for each column in testB, I want to calculate the mean and standard deviation for each Treatment group.
I started by trying to use tapply (over just one column to begin with), but get back "NA" for Treatment groups that shouldn't be in testB, which isn't a big problem with this small dataset, but is pretty irksome with my real data:
>tapply(testB$Day1, testB$Treatment, mean)
A B C D E
70.66667 61.00000 NA NA NA
I tried implementing solutions from Compute mean and standard deviation by group for multiple variables in a data.frame. Using aggregate worked:
ag <- aggregate(. ~ Treatment, testB, function(x) c(mean = mean(x), sd = sd(x)))
But I can't get the data.table solutions to work.
library(data.table)
testB[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by = Treatment]
testB[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = Treatment]
both gave me the error message
Error in `[.data.frame`(testB, , c(mean = lapply(.SD, mean), sd = lapply(.SD, :
unused argument(s) (by = Treatment)
What am I doing wrong?
Thanks in advance for helping a clueless beginner!
Your columns are factors. Although you've dropped the rows that have the treatments "C", "D", and "E" in your subset testB, those levels still exist. Use levels(testB) to see them. You can use the droplevels function when defining your testB subset to allow you to get means for A and B without returning NAs for empty factor levels.
testB <- droplevels(mydata[(mydata[, "Experiment"] %in% c("E1", "E4"))
& mydata[, "Treatment"] %in% c("A", "B"),
c("Treatment", "Day1", "Day2", "Day4")]
tapply(testB$Day1,testB$Treatment,mean)
A B
59.16667 66.00000
Hope this helps!
Ron
You could use plyr and reshape2 to tackle this problem as well; I generally prefer to use these libraries because the abstractions they introduce apply to more problems, and are cleaner.
How I would solve it:
library(plyr)
library(reshape2)
# testB from your code above
# make a "long" version of testB
longTestB <- melt(testB, id.vars="Treatment")
# then use ddply for calculating your metrics
ddply(longTestB, .(Treatment), summarize, mean=mean(value), stdev=sd(value))