Replace column values based on column in another dataframe - r

I would like to replace some column values in a df based on column in another data frame
This is the head of the first df:
df1
A tibble: 253 x 2
id sum_correct
<int> <dbl>
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 16
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92
and some sum_correct need to be replaced by the correct values in another df using the id to trigger the replacement
df 2
A tibble: 14 x 2
id sum_correct
<int> <dbl>
1 866103 61
2 866124 79
3 866152 85
4 867101 24
5 867140 76
6 867146 51
7 867152 56
8 867200 50
9 867209 97
10 879657 56
11 879680 61
12 879683 58
13 879693 77
14 881451 57
how I can achieve this in R studio? thanks for the help in advance.

You can make an update join using match to find where id matches and remove non matches (NA) with which:
idx <- match(df1$id, df2$id)
idxn <- which(!is.na(idx))
df1$sum_correct[idxn] <- df2$sum_correct[idx[idxn]]
df1
id sum_correct
1 866093 77
2 866097 95
3 866101 37
4 866102 65
5 866103 61
6 866104 72
7 866105 99
8 866106 90
9 866108 74
10 866109 92

you can do a left_join and then use coalesce:
library(dplyr)
left_join(df1, df2, by = "id", suffix = c("_1", "_2")) %>%
mutate(sum_correct_final = coalesce(sum_correct_2, sum_correct_1))
The new column sum_correct_final contains the value from df2 if it exists and from df1 if a corresponding entry from df2 does not exist.

Related

How do I regroup data?

I am looking to change the structure of my dataframe, but I am not really sure how to do it. I am not even sure how to word the question either.
ID <- c(1,8,6,2,4)
a <- c(111,94,85,76,72)
b <- c(75,37,86,55,62)
dataframe <- data.frame(ID,a,b)
ID a b
1 1 111 75
2 8 94 37
3 6 85 86
4 2 76 55
5 4 72 62
Above is the code with the output, however, I want the output to look like the following; however, the only way I know how to do this is to just type it manually, is there any other way other than changing the input manually? Because I have quite a large data set that I would like to change and manually would just take forever.
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
We may use pivot_longer
library(dplyr)
library(tidyr)
dataframe %>%
pivot_longer(cols = a:b, names_to = 'letter')
-output
# A tibble: 10 × 3
ID letter value
<dbl> <chr> <dbl>
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62
A base R option using reshape:
df <- reshape(dataframe, direction = "long",
v.names = "value",
varying = 2:3,
times = names(dataframe)[2:3],
timevar = "letter",
idvar = "ID")
df <- df[ order(match(df$ID, dataframe$ID)), ]
row.names(df) <- NULL
Output
ID letter value
1 1 a 111
2 1 b 75
3 8 a 94
4 8 b 37
5 6 a 85
6 6 b 86
7 2 a 76
8 2 b 55
9 4 a 72
10 4 b 62

Populate new column row-by-row using values from different existing columns, using date as selection criteria

I am using R to edit a csv of GPS points. The table looks kind of like this:
ID DATE 2002.08.01 2002.08.02 2002.08.03 2002.08.04
1 8/1/2002 56 41 54 89
2 8/2/2002 65 59 69 10
3 8/2/2002 66 51 61 5
4 8/3/2002 11 21 12 32
Each column in the table above that has a date as the column header is a snow depth for one specific day at that GPS point. What I want is a new column SNOW_DEPTH, that only has the snow depth for the correct date for that GPS point. In the example data I gave, the solution I am looking for is this:
ID DATE SNOW_DEPTH
1 8/1/2002 56
2 8/2/2002 59
3 8/2/2002 51
4 8/3/2002 12
Notice that the values for SNOW_DEPTH in the solution table are populated from snow depth values from that row but the column used for population depends on the date.
I do not want to list each column by name, as in my real data there are thousands of columns (all with dates as column headers). Is there a simple, script-based R solution to my dilemma?
Here's a solution using the tidyverse suite of packages. Note that I'm assuming that DATE is stored as a character or factor.
df <- read_table("ID DATE 2002.08.01 2002.08.02 2002.08.03 2002.08.04
1 8/1/2002 56 41 54 89
2 8/2/2002 65 59 69 10
3 8/2/2002 66 51 61 5
4 8/3/2002 11 21 12 32")
library(tidyverse)
df %>%
gather(COL_DATE, SNOW_DEPTH, -ID, -DATE) %>%
mutate( # this converts both `DATE` and `COL_DATE` to the date-time format. If `DATE` is already in this format, skip the first conversion (you still need to convert `COL_DATE`).
DATE = as.Date(DATE,format = "%m/%d/%Y"),
COL_DATE = as.Date(COL_DATE, format = "%Y.%m.%d")
) %>%
filter(DATE == COL_DATE) %>%
select(-COL_DATE)
I think your best bet is to make a 'long' file with ID/date/value running down the page and then just merge this back to your initial data matching on ID and date:
merge(
transform(dat[1:2], ind=format(as.Date(DATE, format="%m/%d/%Y"), "%Y.%m.%d")),
cbind(dat["ID"], stack(dat[-(1:2)]))
)
# ID ind DATE values
#1 1 2002.08.01 8/1/2002 56
#2 2 2002.08.02 8/2/2002 59
#3 3 2002.08.02 8/2/2002 51
#4 4 2002.08.03 8/3/2002 12
cbind(dat["ID"], stack(dat[-(1:2)])) gives the long file:
# ID values ind
#1 1 56 2002.08.01
#2 2 65 2002.08.01
#3 3 66 2002.08.01
# <snip>
And transform(dat[1:2], ind=format(as.Date(DATE, format="%m/%d/%Y"), "%Y.%m.%d")) gives the correctly formatted date to merge back on:
# ID DATE ind
#1 1 8/1/2002 2002.08.01
#2 2 8/2/2002 2002.08.02
#3 3 8/2/2002 2002.08.02
#4 4 8/3/2002 2002.08.03
Where dat for this example was:
dat <- read.table(text="ID DATE 2002.08.01 2002.08.02 2002.08.03 2002.08.04
1 8/1/2002 56 41 54 89
2 8/2/2002 65 59 69 10
3 8/2/2002 66 51 61 5
4 8/3/2002 11 21 12 32", header=TRUE, stringsAsFactors=FALSE)

All variables not read from data pipeline - dplyr

I have a dataset with 8 variables,when I run dplyr with syntax below, my output dataframe only has the variables I have used in the dplyr code, while I want all variables
ShowID<-MyData %>%
group_by(id) %>%
summarize (count=n()) %>%
filter(count==min(count))
ShowID
So my output will have two variables - ID and Count. How do I get rest of my variables in the new dataframe? Why is this happening, what am I clueless about here?
> ncol(ShowID)
[1] 2
> ncol(MyData)
[1] 8
MYDATA
key ID v1 v2 v3 v4 v5 v6
0-0-70cf97 1 89 20 30 45 55 65
3ad4893b8c 1 4 5 45 45 55 65
0-0-70cf97d7 2 848 20 52 66 56 56
0-0-70cf 2 54 4 846 65 5 5
0-0-793b8c 3 56454 28 6 4 5 65
0-0-70cf98 2 8 4654 30 65 6 21
3ad4893b8c 2 89 66 518 156 16 65
0-0-70cf97d8 3 89 20 161 1 55 45465
0-0-70cf 5 89 79 48 45 55 456
0-0-793b8c 5 89 20 48 545 654 4
0-0-70cf99 6 9 20 30 45 55 65
DESIRED
key ID count v1 v2 v3 v4 v5 v6
0-0-70cf99 6 1 9 20 30 45 55 65
RESULT FROM CODE
ID count
6 1
You can use the base R ave method to calculate number of rows in each group (ID) and then select those group which has minimum rows.
num_rows <- ave(MyData$v1, MyData$ID, FUN = length)
MyData[which(num_rows == min(num_rows)), ]
# key ID v1 v2 v3 v4 v5 v6
#11 0-0-70cf99 6 9 20 30 45 55 65
You could also use which.min in this case to avoid one step however, in case of multiple minimum values it would fail hence, I have used which.
No need to summarize:
ShowID <- MyData %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
filter(count == min(count))

Transpose data in R by IDnumber

I have a question of transposing data in R. Basically I am looking for an alternative to do proc transpose by id prefix = test and proc transpose by id prefix = score in R.
So I have a set of data looks like the following
ID test date score
1 4/1/2001 98
1 5/9/2001 65
1 5/23/2001 85
2 3/21/2001 76
2 4/8/2001 58
2 5/22/2001 67
2 6/15/2001 53
3 1/15/2001 46
3 5/30/2001 55
4 1/8/2001 71
4 2/14/2001 95
4 7/15/2001 93
and I would love to transpose it into:
id test date1 score1 test date2 score2 testdate3 score3 testdate4 score4
1 4/1/2000 98 5/9/2001 65 5/23/2001 85 .
2 3/21/2001 76 4/8/2001 58 5/22/2001 67 6/15/2001 53
3 1/15/2001 46 5/30/2001 55 . .
4 1/8/2001 71 2/14/2001 95 7/15/2001 93 .
This is a basic "long" to "wide" reshaping task. In base R, you can use reshape, but only after adding a "time" variable, like this:
mydf$time <- with(mydf, ave(ID, ID, FUN = seq_along))
reshape(mydf, direction = "wide", idvar = "ID", timevar = "time")
# ID test.date.1 score.1 test.date.2 score.2 test.date.3 score.3
# 1 1 4/1/2001 98 5/9/2001 65 5/23/2001 85
# 4 2 3/21/2001 76 4/8/2001 58 5/22/2001 67
# 8 3 1/15/2001 46 5/30/2001 55 <NA> NA
# 10 4 1/8/2001 71 2/14/2001 95 7/15/2001 93
# test.date.4 score.4
# 1 <NA> NA
# 4 6/15/2001 53
# 8 <NA> NA
# 10 <NA> NA

merging time series and cross-section time series dataframes

I am having an issue merging 2 dataframes in R - one is a time series cross-section (i.e. a panel) and the other is a simple time series. Suppose I have two dataframes, df1 and df2, that I would like to merge. The panel dataframe df1 is given by
id year var1
1 80 3
1 81 5
1 82 7
1 83 9
2 80 5
2 81 5
2 82 7
2 83 5
3 80 9
3 81 9
3 82 7
3 83 3
while the time series dataframe df2 is given by
year var2
80 10
81 15
82 17
83 19
I would like to merge df1 and df2 into a third dataframe df, while preserving the time series cross-section row ordering of df1. However, when I use the command
df <- merge(df1, df2, by="year")
the new dataframe clusters the observations by year.
year id var1 var2
80 1 3 10
80 2 5 10
80 3 9 10
81 1 5 15
81 2 5 15
81 3 9 15
82 1 7 17
82 2 7 17
82 3 7 17
83 1 9 19
83 2 5 19
83 3 3 19
Does anyone know how I can make the row ordering in df the same as in df1? Thanks in advance!

Resources