So I have a dataframe called "myData"
print(myData)
ID Name Status AGE
123 Mike Yes 18
124 John No 20
125 Lily Yes 21
126 Jasper No 24
127 Toby Yes 27
128 Will No 19
129 Oscar Yes 32
I received an updated dataframe that has updated "Status" called "myData2".
This dataframe has less observations than my original one and only has ID and Status.
This is the updated dataframe
print(myData2)
ID Status
123 Yes
125 Yes
126 Yes
128 No
129 No
Is there function where I can update 'Status' column in myData with the data in myData2 using the column "ID"?
This is my desired ouput
ID Name Status AGE
123 Mike Yes 18
124 John No 20
125 Lily Yes 21
126 Jasper Yes 24
127 Toby Yes 27
128 Will No 19
129 Oscar No 32
We can use data.table join to quickly update the first dataset 'Status' with the values of second after joining on 'ID'
library(data.table)
setDT(myData)[myData2, Status := i.Status, on = .(ID)]
myData
# ID Name Status AGE
#1: 123 Mike Yes 18
#2: 124 John No 20
#3: 125 Lily Yes 21
#4: 126 Jasper Yes 24
#5: 127 Toby Yes 27
#6: 128 Will No 19
#7: 129 Oscar No 32
In dplyr, we do a left_join and then coalesce the 'Status' columns
library(dplyr)
myData %>%
left_join(myData2, by = 'ID') %>%
mutate(Status = coalesce(Status.y, Status.x)) %>%
select(-Status.x, -Status.y)
data
myData <- structure(list(ID = 123:129, Name = c("Mike", "John", "Lily",
"Jasper", "Toby", "Will", "Oscar"), Status = c("Yes", "No", "Yes",
"No", "Yes", "No", "Yes"), AGE = c(18L, 20L, 21L, 24L, 27L, 19L,
32L)), class = "data.frame", row.names = c(NA, -7L))
myData2 <- structure(list(ID = c(123L, 125L, 126L, 128L, 129L), Status = c("Yes",
"Yes", "Yes", "No", "No")), class = "data.frame", row.names = c(NA,
-5L))
Here is a base R solution using merge, i.e.,
myData$Status <- with(merge(myData,myData2,by = "ID",all.x = TRUE),
ifelse(is.na(Status.y),Status.x,Status.y))
such that
> myData
ID Name Status AGE
1 123 Mike Yes 18
2 124 John No 20
3 125 Lily Yes 21
4 126 Jasper Yes 24
5 127 Toby Yes 27
6 128 Will No 19
7 129 Oscar No 32
Related
I'm trying to calculate percent change in R with each of the time points included in the column label (table below). I have dplyr loaded and my dataset was loaded in R and I named it data. Below is the code I'm using but it's not calculating correctly. I want to create a new dataframe called data_per_chg which contains the percent change from "v1" each variable from. For instance, for wbc variable, I would like to calculate percent change of wbc.v1 from wbc.v1, wbc.v2 from wbc.v1, wbc.v3 from wbc.v1, etc, and do that for all the remaining variables in my dataset. I'm assuming I can probably use a loop to easily do this but I'm pretty new to R so I'm not quite sure how proceed. Any guidance will be greatly appreciated.
id
wbc.v1
wbc.v2
wbc.v3
rbc.v1
rbc.v2
rbc.v3
hct.v1
hct.v2
hct.v3
a1
23
63
30
23
56
90
13
89
47
a2
81
45
46
N/A
18
78
14
45
22
a3
NA
27
14
29
67
46
37
34
33
data_per_chg<-data%>%
group_by(id%>%
arrange(id)%>%
mutate(change=(wbc.v2-wbc.v1)/(wbc.v1))
data_per_chg
Assuming the NA values are all NA and no N/A
library(dplyr)
library(stringr)
data <- data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(-c(id, matches("\\.v1$")), ~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change"))
-output
data
id wbc.v1 wbc.v2 wbc.v3 rbc.v1 rbc.v2 rbc.v3 hct.v1 hct.v2 hct.v3 wbc.v2_change wbc.v3_change rbc.v2_change rbc.v3_change hct.v2_change hct.v3_change
1 a1 23 63 30 23 56 90 13 89 47 1.7391304 0.3043478 1.434783 2.9130435 5.84615385 2.6153846
2 a2 81 45 46 NA 18 78 14 45 22 -0.4444444 -0.4320988 NA NA 2.21428571 0.5714286
3 a3 NA 27 14 29 67 46 37 34 33 NA NA 1.310345 0.5862069 -0.08108108 -0.1081081
If we want to keep the 'v1' columns as well
data %>%
na_if("N/A") %>%
type.convert(as.is = TRUE) %>%
mutate(across(ends_with('.v1'), ~ .x - .x,
.names = "{str_replace(.col, 'v1', 'v1change')}")) %>%
transmute(id, across(ends_with('change')),
across(-c(id, matches("\\.v1$"), ends_with('change')),
~ {
v1 <- get(str_replace(cur_column(), "v\\d+$", "v1"))
(.x - v1)/v1}, .names = "{.col}_change")) %>%
select(id, starts_with('wbc'), starts_with('rbc'), starts_with('hct'))
-output
id wbc.v1change wbc.v2_change wbc.v3_change rbc.v1change rbc.v2_change rbc.v3_change hct.v1change hct.v2_change hct.v3_change
1 a1 0 1.7391304 0.3043478 0 1.434783 2.9130435 0 5.84615385 2.6153846
2 a2 0 -0.4444444 -0.4320988 NA NA NA 0 2.21428571 0.5714286
3 a3 NA NA NA 0 1.310345 0.5862069 0 -0.08108108 -0.1081081
data
data <- structure(list(id = c("a1", "a2", "a3"), wbc.v1 = c(23L, 81L,
NA), wbc.v2 = c(63L, 45L, 27L), wbc.v3 = c(30L, 46L, 14L), rbc.v1 = c("23",
"N/A", "29"), rbc.v2 = c(56L, 18L, 67L), rbc.v3 = c(90L, 78L,
46L), hct.v1 = c(13L, 14L, 37L), hct.v2 = c(89L, 45L, 34L), hct.v3 = c(47L,
22L, 33L)), class = "data.frame", row.names = c(NA, -3L))
I have a fairly straightforward need, but I can't find a previously asked question that is similar enough. I've been trying with dplyr, but can't figure it out.
julian year
088 22
049 19
041 22
105 18
125 22
245 20
What I want is for each value where data$julian < 105, subtract '1' from data$year, so that
julian year
088 21
049 18
041 21
105 18
125 22
245 20
OP asked about using dplyr in the post. Here, is one with dplyr
library(dplyr)
df1 <- df1 %>%
mutate(year = case_when(as.numeric(julian) < 105 ~ year -1,
TRUE ~ as.numeric(year)))
-output
df1
julian year
1 088 21
2 049 18
3 041 21
4 105 18
5 125 22
6 245 20
data
df1 <- structure(list(julian = c("088", "049", "041", "105", "125",
"245"), year = c(22L, 19L, 22L, 18L, 22L, 20L)), row.names = c(NA,
-6L), class = "data.frame")
Another option with base R:
df$year[df$julian < 105] <- df$year[df$julian < 105] - 1
Output
julian year
1 088 21
2 049 18
3 041 21
4 105 18
5 125 22
6 245 20
Data
df <- structure(list(name = c("KKSWAP", "KKSWAP"), code = c("The liquidations code for Marco are: 51-BMR05, 74-VAD08, 176-VNF09.",
"The liquidations code for Clara are: 88-BMR05, 90-VAD08, 152-VNF09."
)), class = "data.frame", row.names = c(NA, -2L))
This question already has answers here:
How to specify "does not contain" in dplyr filter
(4 answers)
dplyr Exclude row [duplicate]
(1 answer)
Closed 3 years ago.
This is my dataframe x
ID Name Initials AGE
123 Mike NA 18
124 John NA 20
125 Lily NA 21
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
129 Oscar NA 32
I also have a list of ID's I want to remove from data frame x, num[1:3], which is the following: y
>print(y)
[1] 124 125 129
My goal is remove all the ID's in y from data frame x
This is my desired output
ID Name Initials AGE
123 Mike NA 18
126 Jasper NA 24
127 Toby NA 27
128 Will NA 19
I'm using the dplyr package and trying this but its not working,
FinalData <- x %>%
select(everything()) %>%
filter(ID != c(y))
Can anyone tell me what needs to be corrected?
We can use %in% and negate ! when the length of the 'y' is greater than 1. The select step is not needed as it is selecting all the columns with everything()
library(dplyr)
x %>%
filter(!ID %in% y)
# ID Name Initials AGE
#1 123 Mike NA 18
#2 126 Jasper NA 24
#3 127 Toby NA 27
#4 128 Will NA 19
Or another option is anti_join
x %>%
anti_join(tibble(ID = y))
In base R, subset can be used
subset(x, !ID %in% y)
data
y <- c(124, 125, 129)
x <- structure(list(ID = 123:129, Name = c("Mike", "John", "Lily",
"Jasper", "Toby", "Will", "Oscar"), Initials = c(NA, NA, NA,
NA, NA, NA, NA), AGE = c(18L, 20L, 21L, 24L, 27L, 19L, 32L)),
class = "data.frame", row.names = c(NA,
-7L))
I have original temperature data in table1.txt with station number header which reads as
Date 101 102 103
1/1/2001 25 24 23
1/2/2001 23 20 15
1/3/2001 22 21 17
1/4/2001 21 27 18
1/5/2001 22 30 19
I have a lookup table file lookup.txt which reads as :
ID Station
1 101
2 103
3 102
4 101
5 102
Now, I want to create a new table (new.txt) with ID number header which should read as
Date 1 2 3 4 5
1/1/2001 25 23 24 25 24
1/2/2001 23 15 20 23 20
1/3/2001 22 17 21 22 21
1/4/2001 21 18 27 21 27
1/5/2001 22 19 30 22 30
Is there anyway I can do this in R or matlab??
I came up with a solution using tidyverse. It involves some wide to long transformation, matching the data frames on Station, and then spreading the variables.
#Recreating the data
library(tidyverse)
df1 <- read_table("text1.txt")
lookup <- read_table("lookup.txt")
#Create the output
k1 <- df1 %>%
gather(Station, value, -Date) %>%
mutate(Station = as.numeric(Station)) %>%
inner_join(lookup) %>% select(-Station) %>%
spread(ID, value)
k1
We can use base R to do this. Create a column index by matching the 'Station' column with the names of the first dataset, use that to duplicate the columns of 'df1' and then change the column names with the 'ID' column of second dataset
i1 <- with(df2, match(Station, names(df1)[-1]))
dfN <- df1[c(1, i1 + 1)]
names(dfN)[-1] <- df2$ID
dfN
# Date 1 2 3 4 5
#1 1/1/2001 25 23 24 25 24
#2 1/2/2001 23 15 20 23 20
#3 1/3/2001 22 17 21 22 21
#4 1/4/2001 21 18 27 21 27
#5 1/5/2001 22 19 30 22 30
data
df1 <- structure(list(Date = c("1/1/2001", "1/2/2001", "1/3/2001", "1/4/2001",
"1/5/2001"), `101` = c(25L, 23L, 22L, 21L, 22L), `102` = c(24L,
20L, 21L, 27L, 30L), `103` = c(23L, 15L, 17L, 18L, 19L)),
class = "data.frame", row.names = c(NA,
-5L))
df2 <- structure(list(ID = 1:5, Station = c(101L, 103L, 102L, 101L,
102L)), class = "data.frame", row.names = c(NA, -5L))
Here is an option with MatLab:
T = readtable('table1.txt','FileType','text','ReadVariableNames',1);
L = readtable('lookup.txt','FileType','text','ReadVariableNames',1);
old_header = strcat('x',num2str(L.Station));
newT = array2table(zeros(height(T),height(L)+1),...
'VariableNames',[{'Date'} strcat('x',num2cell(num2str(L.ID)).')]);
newT.Date = T.Date;
for k = 1:size(old_header,1)
newT{:,k+1} = T.(old_header(k,:));
end
writetable(newT,'new.txt','Delimiter',' ')
I've a dataframe df file with the following data:
ID P1 P2 Year Month A B
11084 23 43 2001 April 41.9 -99.99
67985 76 12 2001 May 6.9 -9.99
11084 34 64 2001 June -999 -99.99
34084 56 77 2001 July NA -99.99
11043 90 54 2001 August NA -99.99
23084 55 32 2001 September 50.8 -99.99
11084 77 14 2001 October 0 -99.99
54328 89 56 2001 November -999 -99.99
I'm trying to add two new columns and fill 'Yes'/'No' values for the records with missing values. My expected output is:
ID P1 P2 Year Month A B A_miss B_miss
11084 23 43 2001 April 41.9 -99.99 No Yes
67985 76 12 2001 May 6.9 123 No No
11084 34 64 2001 June -999 -99.99 Yes Yes
34084 56 77 2001 July NA -99.99 Yes Yes
11043 90 54 2001 August NA -99.99 Yes Yes
23084 55 32 2001 September 50.8 -99.99 No Yes
11084 77 14 2001 October 0 -99.99 No Yes
54328 89 56 2001 November -999 -99.99 Yes Yes
I'm new to R. I was trying to achieve this using simple for loop and if/else conditions in the following way:
for(i in length(df$A))
{
if(df$A[i] == -999 || df$A[i] == 'NA')
df$A_miss[i] <- 'Yes'
else
df$A_miss[i] <- 'No'
}
I was firstly trying the loop on 'A' column, but only the else part was executing everytime I try and the 'No' values are being filled in the entire 'A_miss' column. I'm unable to find out why the if part isn't working.
Where am I going wrong?
Your loop is not correctly defined. This one works:
for (i in 1:length(df$A)) {
if(df$A[i] == -999 || is.na(df$A[i]) )
df$A_miss[i] <- 'Yes'
else
df$A_miss[i] <- 'No'
}
The limit should be set as (i in 1:length(df$A)), and not as (i in length(df$A). Hope this helps.
PS: As you can see, the important correction pointed out by #Pascal has been implemented here.
PPS: The version below should be much faster than your code with the for loop:
df$A_miss <- 'No'
df$A_miss[which(df$A==-999 | is.na(df$A)] <- 'Yes'
(I just noticed that this solution is very similar to the one that had been suggested earlier by #Daniel Fischer)
A vectorized version:
df <- structure(list(ID = c(11084L, 67985L, 11084L, 34084L, 11043L,
23084L, 11084L, 54328L), P1 = c(23L, 76L, 34L, 56L, 90L, 55L,
77L, 89L), P2 = c(43L, 12L, 64L, 77L, 54L, 32L, 14L, 56L), Year = c(2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L), Month = structure(c(1L,
5L, 4L, 3L, 2L, 8L, 7L, 6L), .Label = c("April", "August", "July",
"June", "May", "November", "October", "September"), class = "factor"),
A = c(41.9, 6.9, -999, NA, NA, 50.8, 0, -999), B = c(-99.99,
123, -99.99, -99.99, -99.99, -99.99, -99.99, -99.99), A_miss = c("No",
"No", "Yes", "Yes", "Yes", "No", "No", "Yes")), .Names = c("ID",
"P1", "P2", "Year", "Month", "A", "B", "A_miss"), row.names = c(NA,
-8L), class = "data.frame")
df$A_miss <- ifelse(df$A == -999 | is.na(df$A), "yes", "no")
df$B_miss <- ifelse(df$B == -99.99 | is.na(df$B), "yes", "no")
ID P1 P2 Year Month A B A_miss B_miss
1 11084 23 43 2001 April 41.9 -99.99 no yes
2 67985 76 12 2001 May 6.9 123.00 no no
3 11084 34 64 2001 June -999.0 -99.99 yes yes
4 34084 56 77 2001 July NA -99.99 yes yes
5 11043 90 54 2001 August NA -99.99 yes yes
6 23084 55 32 2001 September 50.8 -99.99 no yes
7 11084 77 14 2001 October 0.0 -99.99 no yes
8 54328 89 56 2001 November -999.0 -99.99 yes yes
Maybe you could try this, without any loop or if clause:
df$A[(df$A==-999)|(is.na(df$A))] <- "yes"
df$A[df$A!="yes"] <- "no"
Using the which command might increase the speed of the process:
df$A_miss[which(df$A==-999 | is.na(df$A))] <- 'Yes'
df$A_miss[which(df$A_miss!='Yes')] <- 'no'