Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA
Related
I have monthly data for 2019, 2020 and only 2 months data for 2021 (Jan and Feb). I want to make a vector of these 26 values for use as a time series.
my_dat <- data.frame(X2021 = c(1:2,rep(NA,10)), X2020 = 1:12, X2019 = 1:12)
library(dplyr)
X2021 <- my_dat %>% pull(X2021)
X2021 <- X2021[ -(3:12) ]
x <- my_dat %>% pull(X2019,X2020)
c(x, X2021)
##1 2 3 4 5 6 7 8 9 10 11 12
##1 2 3 4 5 6 7 8 9 10 11 12 1 2
I expected:
c(1:12, 1:12, 1:2)
What went wrong?
Since pull is equivalent to $ in base R, and can only be used for extracting one variable, I think you want select and then unlist. E.g.:
my_dat %>% select(X2019, X2020) %>% unlist(use.names=FALSE)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
Which would be equivalent to using the square brackets [] in base R:
unlist(my_dat[c("X2019","X2020")], use.names=FALSE)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
As to why the original code didn't work, ?pull shows the syntax is:
pull(.data, var, name)
So
my_dat %>% pull(X2019,X2020)
is just pulling / extracting X2019 and naming it with X2020. To give a clearer example:
dat <- data.frame(a=1:3, b=month.abb[1:3])
pull(dat, a, b)
#Jan Feb Mar
# 1 2 3
unname(pull(dat, a, b))
#[1] 1 2 3
names(pull(dat, a, b))
#[1] "Jan" "Feb" "Mar"
I have a data frame as follows:
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6
4 4 25 7
5 5 3 7
6 6 34 7
For each row with the sample_id, I would like to add a unique identifier as follows:
index val sample_id
1 1 14 5
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
Any suggestion? Thank you for your help.
Base R
dat$id2 <- ave(dat$sample_id, dat$sample_id,
FUN = function(z) if (length(z) > 1) paste(z, LETTERS[seq_along(z)], sep = "-") else as.character(z))
dat
# index val sample_id id2
# 1 1 14 5 5
# 2 2 22 6 6-A
# 3 3 1 6 6-B
# 4 4 25 7 7-A
# 5 5 3 7 7-B
# 6 6 34 7 7-C
tidyverse
library(dplyr)
dat %>%
group_by(sample_id) %>%
mutate(id2 = if (n() > 1) paste(sample_id, LETTERS[row_number()], sep = "-") else as.character(sample_id)) %>%
ungroup()
Minor note: it might be tempting to drop the as.character(z) from either or both of the code blocks. In the first, nothing will change (here): base R allows you to be a little sloppy; if we rely on that and need the new field to always be character, then in that one rare circumstance where all rows have unique sample_id, then the column will remain integer. dplyr is much more careful in guarding against this; if you run the tidyverse code without as.character, you'll see the error.
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
If you just want to create unique tags for the same sample_id, maybe you can try make.unique like below
transform(
df,
sample_id = ave(as.character(sample_id),sample_id,FUN = function(x) make.unique(x,sep = "_"))
)
which gives
index val sample_id
1 1 14 5
2 2 22 6
3 3 1 6_1
4 4 25 7
5 5 3 7_1
6 6 34 7_2
I am trying to add additional data from a reference table onto my primary dataframe. I see similar questions have been asked about this however cant find anything for my specific case.
An example of my data frame is set up like this
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
print(df)
participant time
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
13 1 5
14 2 5
15 3 5
16 1 6
17 2 6
18 3 6
19 1 7
20 2 7
21 3 7
22 1 8
23 2 8
24 3 8
25 1 9
26 2 9
27 3 9
> print(lookup)
start.time end.time var1 var2 var3
1 1 3 A 8 fast
2 5 6 B 12 fast
3 8 10 A 3 slow
What I want to do is merge or join these two dataframes in a way which also includes the times in between both the start and end time of the look up data frame. So the columns var1, var2 and var3 are added onto the df at each instance where the time lies between the start time and end time.
for example, in the above case - the look up value in the first row has a start time of 1, an end time of 3, so for times 1, 2 and 3 for each participant, the first row data should be added.
the output should look something like this.
print(output)
participant time var1 var2 var3
1 1 1 A 8 fast
2 2 1 A 8 fast
3 3 1 A 8 fast
4 1 2 A 8 fast
5 2 2 A 8 fast
6 3 2 A 8 fast
7 1 3 A 8 fast
8 2 3 A 8 fast
9 3 3 A 8 fast
10 1 4 <NA> NA <NA>
11 2 4 <NA> NA <NA>
12 3 4 <NA> NA <NA>
13 1 5 B 12 fast
14 2 5 B 12 fast
15 3 5 B 12 fast
16 1 6 B 12 fast
17 2 6 B 12 fast
18 3 6 B 12 fast
19 1 7 <NA> NA <NA>
20 2 7 <NA> NA <NA>
21 3 7 <NA> NA <NA>
22 1 8 A 3 slow
23 2 8 A 3 slow
24 3 8 A 3 slow
25 1 9 A 3 slow
26 2 9 A 3 slow
27 3 9 A 3 slow
I realise that column names don't match and they should for merging data sets.
One option would be to use the sqldf package, and phrase your problem as a SQL left join:
sql <- "SELECT t1.participant, t1.time, t2.var1, t2.var2, t2.var3
FROM df t1
LEFT JOIN lookup t2
ON t1.time BETWEEN t2.\"start.time\" AND t2.\"end.time\""
output <- sqldf(sql)
A dplyr solution:
output <- df %>%
# Create an id for the join
mutate(merge_id=1) %>%
# Use full join to create all the combinations between the two datasets
full_join(lookup %>% mutate(merge_id=1), by="merge_id") %>%
# Keep only the rows that we want
filter(time >= start.time, time <= end.time) %>%
# Select the relevant variables
select(participant,time,var1:var3) %>%
# Right join with initial dataset to get the missing rows
right_join(df, by = c("participant","time")) %>%
# Sort to match the formatting asked by OP
arrange(time, participant)
This produces the output asked by OP, but it will only work for data of reasonable size, as the full join produces a data frame with number of rows equal to the product of the number of rows of both initial datasets.
Using tidyverse and creating an auxiliary table:
df <- data.frame("participant" = rep(1:3,9), "time" = rep(1:9, each = 3))
lookup <- data.frame("start.time" = c(1,5,8), "end.time" = c(3,6,10), "var1" = c("A","B","A"),
"var2" = c(8,12,3), "var3"= c("fast","fast","slow"))
lookup_extended <- lookup %>%
mutate(time = map2(start.time, end.time, ~ c(.x:.y))) %>%
unnest(time) %>%
select(-start.time, -end.time)
df2 <- df %>%
left_join(lookup_extended, by = "time")
I have coordinates for each site and the year each site was sampled (fake dataframe below).
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
> dfA
LAT LONG YEAR
1 1 11 2001
2 2 12 2002
3 3 13 2003
4 4 14 2004
5 5 15 2005
6 1 11 2006
7 2 12 2007
8 3 13 2008
9 4 14 2009
10 5 15 2010
11 1 11 2011
12 2 12 2012
13 3 13 2013
14 4 14 2014
15 5 15 2015
16 1 16 2016
17 2 17 2017
18 3 18 2018
19 4 19 2019
20 5 20 2020
I'm trying to pull out the year each unique location was sampled. So I first pulled out each unique location and the times it was sampled using the following code
dfB <- dfA %>%
group_by(LAT, LONG) %>%
summarise(Freq = n())
dfB<-as.data.frame(dfB)
LAT LONG Freq
1 1 11 3
2 1 16 1
3 2 12 3
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
I am now trying to get the year for each unique location. I.e. I ultimately want this:
LAT LONG Freq . Year
1 1 11 3 . 2001,2006,2011
2 1 16 1 . 2016
3 2 12 3 . 2002,2007,2012
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
This is what I've tried:
1) Find which rows in dfA that corresponds with dfB:
dfB$obs_Year<-NA
idx <- match(paste(dfA$LAT,dfA$LONG), paste(dfB$LAT,dfB$LONG))
> idx
[1] 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 2 4 6 8 10
So idx[1] means dfA[1] matches dfB[1]. And that dfA[6],df[11] all match dfB[1].
I've tried this to extract info:
for (row in 1:20){
year<-as.character(dfA$YEAR[row])
tmp<-dfB$obs_Year[idx[row]]
if(isTRUE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-year
}
if(isFALSE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-as.list(append(tmp,year))
}
}
I keep getting this error code:
number of items to replace is not a multiple of replacement length
Does anyone know how to extract years from matching pairs of dfA to dfB? I don't know if this is the most efficient code but this is as far as I've gotten....Thanks in advance!
You can do this with a dplyr chain that first builds your date column and then filters down to only unique observations.
The logic is to build the date variable by grouping your data by locations, and then pasting all the dates for a given location into a single string variable which we call year_string. We then also compute the frequency but this is not strictly necessary.
The only column in your data that varies over time is YEAR, meaning that if we exclude that column you would see values repeated for locations. So we exclude the YEAR column and then ask R to return unique() values of the data.frame to us. It will pick one of the observations per location where multiple occur, but since they are identical that doesn't matter.
Code below:
library(dplyr)
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
# We assign the output to dfB
dfB <- dfA %>% group_by(LAT, LONG) %>% # We group by locations
mutate( # The mutate verb is for building new variables.
year_string = paste(YEAR, collapse = ","), # the function paste()
# collapses the vector YEAR into a string
# the argument collapse = "," says to
# separate each element of the string with a comma
Freq = n()) %>% # I compute the frequency as you did
select(LAT, LONG, Freq, year_string) %>%
# Now I select only the columns that index
# location, frequency and the combined years
unique() # Now I filter for only unique observations. Since I have not picked
# YEAR in the select function only unique locations are retained
dfB
#> # A tibble: 10 x 4
#> # Groups: LAT, LONG [10]
#> LAT LONG Freq year_string
#> <int> <int> <int> <chr>
#> 1 1 11 3 2001,2006,2011
#> 2 2 12 3 2002,2007,2012
#> 3 3 13 3 2003,2008,2013
#> 4 4 14 3 2004,2009,2014
#> 5 5 15 3 2005,2010,2015
#> 6 1 16 1 2016
#> 7 2 17 1 2017
#> 8 3 18 1 2018
#> 9 4 19 1 2019
#> 10 5 20 1 2020
Created on 2019-01-21 by the reprex package (v0.2.1)
I have two dataframes of this format.
df1:
id x y
1 2 3
2 4 5
3 6 7
4 8 9
5 1 1
df2:
id id2 v v2
1 t 11 21
1 b 12 22
2 t 13 23
2 b 14 24
3 t 15 25
3 b 16 26
4 b 17 27
Hence, sometimes, the id in main 'df' will appear twice (maximum) sometimes once, and sometimes not at all. The expected result would be:
df_merged:
id x y v.t v2.t v.b v2.b
1 2 3 11 21 12 22
2 4 5 13 23 24 24
3 6 7 15 25 16 26
4 8 9 NA NA 17 27
5 1 1 NA NA NA NA
I have used merge but due to the fact that id2 in df2 doesn't match, I get two instances of id in df_merged like so:
id x y v v2
1 ...
1 ...
Thanks in advance!
We can start by adjusting df2 to the right format then do a normal joining.
librar(dplyr)
library(tidyr)
df2 %>% gather(key,val,-id,-id2) %>% #Transfer from wide to long format for v and v2
mutate(new_key=paste0(key,'.',id2)) %>% #Create a new id2 as new_key
select(-id2,-key) %>% #de-select the unnessary columns
spread(new_key,val) %>% #Transfer back to wide foramt with right foramt for id
right_join(df1) %>% #right join df1 "To includes all rows in df1" using id
select(id,x,y,v.t,v2.t,v.b,v2.b) #rearrange columns name
Joining, by = "id"
id x y v.t v2.t v.b v2.b
1 1 2 3 11 21 12 22
2 2 4 5 13 23 14 24
3 3 6 7 15 25 16 26
4 4 8 9 NA NA 17 27
5 5 1 1 NA NA NA NA
You can solve this just using merge. Split df2 based on whether id2 equals b or t. Merge these two new objects with df1, and finally merge them together. The code includes one additional step to also include data found in df1 but not df2.
dfb <- merge(df1, df2[df2$id2=='b',], by='id')
dft <- merge(df1, df2[df2$id2=='t',], by='id')
dfRest <- df1[!df1$id %in% df2$id,]
dfAll <- merge(dfb[,c('id','x','y','v','v2')], dft[,c('id','v','v2')], by='id', all.x=T)
merge(dfAll, dfRest, all.x=T, all.y=T)
id x y v.x v2.x v.y v2.y
1 1 2 3 12 22 11 21
2 2 4 5 14 24 13 23
3 3 6 7 16 26 15 25
4 4 8 9 17 27 NA NA
5 5 1 1 NA NA NA NA