I have spent a lot of time trying to write a loop to replace NAs with zeros for certain columns in a data frame and have not yet succeeded. I have searched and can't find similar question.
df <- data.frame(A = c(2, 4, 6, NA, 8, 10),
B = c(NA, 10, 12, 14, NA, 16),
C = c(20, NA, 22, 24, 26, NA),
D = c(30, NA, NA, 32, 34, 36))
df
Gives me:
A B C D
1 2 NA 20 30
2 4 10 NA NA
3 6 12 22 NA
4 NA 14 24 32
5 8 NA 26 34
6 10 16 NA 36
I want to set NAs to 0 for only columns B and D. Using separate code lines, I could:
df$B[is.na(df$B)] <- 0
df$D[is.na(df$D)] <- 0
However, I want to use a loop because I have many variables in my real data set.
I cannot find a way to loop over only columns B and D so I get:
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
Essentially, I want to apply a loop using a variable list to a data frame:
varlist <- c("B", "D")
How can I loop over only certain columns in the data frame using a variable list to replace NAs with zeros?
here is a tidyverse aproach:
library(tidyverse)
df %>%
mutate_at(.vars = vars(B, D), .funs = funs(ifelse(is.na(.), 0, .)))
#output:
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
basically you say vars B and D should change by a defined function. Where . corresponds to the appropriate column.
Here's a base R one-liner
df[, varlist][is.na(df[, varlist])] <- 0
using the zoo package we can fill the selected columns.
library(zoo)
df[varlist]=na.fill(df[varlist],0)
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
In base R we can have
df[varlist]=lapply(df[varlist],function(x){x[is.na(x)]=0;x})
df
A B C D
1 2 0 20 30
2 4 10 NA 0
3 6 12 22 0
4 NA 14 24 32
5 8 0 26 34
6 10 16 NA 36
Related
data <- structure(list(
x = c(1, 2, 1, 2, 2, 1, 3, 3, 1),
y = c(20, 30, 40, 10, 15, 34, 57, 72, 12)),
class = "data.frame",
row.names = c(NA,-9L))
Hi guys, I want to create a new variable from above data.frame in rstudio but it doesn't work. what I want to do is the same of this command in stata but in rstudio
gen var = y*3600 if x == 1
so I runned this r command but it didn´t work:
df$var[df$x == 1] <- df$y*3600
the new variable should look like this:
x
y
var
1
20
72000
2
30
NA
1
40
144000
2
10
NA
2
15
NA
1
34
122400
3
57
NA
3
72
NA
1
12
43200
I appreciate any help and thanks in advance
data$var <- ifelse(data$x == 1, data$y * 3600, NA)
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
We can use replace like below
> transform(
+ data,
+ var = replace(y * 3600, x != 1, NA)
+ )
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
Another option
df$var <- df$y * 3600
df$var[df$x != 1] <- NA
df
#-------
> df
x y var
1 1 20 72000
2 2 30 NA
3 1 40 144000
4 2 10 NA
5 2 15 NA
6 1 34 122400
7 3 57 NA
8 3 72 NA
9 1 12 43200
In data.table
library('data.table')
as.data.table(data)
data[x == 1, var := y*3600]
You need to subset the data from both the ends.
data$var <- NA
data$var[data$x == 1] <- data$y[data$x == 1] *3600
data
# x y var
#1 1 20 72000
#2 2 30 NA
#3 1 40 144000
#4 2 10 NA
#5 2 15 NA
#6 1 34 122400
#7 3 57 NA
#8 3 72 NA
#9 1 12 43200
Another option is to use case_when in dplyr.
library(dplyr)
data <- data %>% mutate(var = case_when(x == 1 ~ y * 3600))
By default if a condition is not satisfied it returns NA.
I am trying to merge together 8 dataframes into one, matching against the row names.
Examples of the dataframes:
DF1
Arable and Horticulture
Acer
100
Achillea
90
Aesculus
23
Alliaria
3
Allium
56
Anchusa
299
DF2
Improved Grassland
Acer
12
Alliaria
3
Allium
50
Brassica
23
Calystegia
299
Campanula
29
And so on for a few hundred rows for different plants and 8 columns of different habitats.
What I want the merged frame to look like:
Arable and Horticulture
Improved Grassland
Acer
100
12
Achillea
90
0
Aesculus
23
0
Alliaria
3
3
Allium
56
50
Anchusa
299
0
Brassica
0
23
Calystegia
0
299
Campanula
0
29
I tried merging
PolPerGen <- merge(DF1, DF2, all=TRUE)
But that does not match up the row name and dropped them entirely in the output
Arable and Horticulture
Improved Grassland
1
100
NA
2
90
NA
3
23
NA
4
2
NA
5
56
NA
6
299
NA
7
NA
12
8
NA
3
9
NA
50
10
NA
23
11
NA
299
12
NA
29
I am completely out of ideas, any thoughts?
Your dataset is,
dat1 = data.frame("Arable and Horticulture" = c(100, 90,23, 3, 56, 299),
row.names = c("Acer", "Achillea", "Aesculus", "Alliaria", "Allium", "Anchusa"))
dat2 = data.frame("Improved Grassland" = c(12, 3, 50, 23, 299, 29),
row.names = c("Acer", "Achillea", "Allium", "Brassica", "Calystegia", "Campanula"))
As #Vinícius Félix suggested first convert rownames to column.
library(tibble)
dat1 = rownames_to_column(dat1, "Plants")
dat2 = rownames_to_column(dat2, "Plants")
Then lets join both the datasets,
library(dplyr)
dat = full_join(dat1, dat2, )
And replace the NA with 0
dat = dat %>% replace(is.na(.), 0)
Plants Arable.and.Horticulture Improved.Grassland
1 Acer 100 12
2 Achillea 90 3
3 Aesculus 23 0
4 Alliaria 3 0
5 Allium 56 50
6 Anchusa 299 0
7 Brassica 0 23
8 Calystegia 0 299
9 Campanula 0 29
I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)
I've spent the better part of a day on this but I keep getting stuck. This wouldn't take me very long using index-match-match in Excel, but I'm newer to R and merging data doesn't seem very straight-forward. I've searched the site and found similar problems but no solutions specific to this type of issue.
I have two data frames. They have different lengths in both dimensions. a is 4x4 and b is 3x3. They partially overlap:
a <- data.frame("ID" = c(1:4), "A" = c(21:24), "B" = c(31:34), "C" = c(41:44))
a
ID A B C
1 1 21 31 41
2 2 22 32 42
3 3 23 33 43
4 4 24 34 44
and
b <- data.frame("ID" = c(4:6), "C" = c(22:24), "D" = c(32:34))
b
ID C D
1 4 22 32
2 5 23 33
3 6 24 34
I'm merging on "ID" number. My goal is to get them to look like
c <- data.frame("ID" = c(1:6), "A" = c(21:24, NA, NA), "B" = c(31:34, NA, NA), "C" = c(41:43,22:24), "D" = c(NA, NA, NA, 32:34))
c
ID A B C D
1 21 31 41 NA
2 22 32 42 NA
3 23 33 43 NA
4 24 34 22 32
5 NA NA 23 33
6 NA NA 24 34
As you can see, the final data frame combines the two and assigns NA to the missing information. In column "C", I would like b to overwrite a where it has numerical values. In this example, the value in c[4,3] should change from 44 to 22.
Most of this is simple enough. But getting column "C" correct has been a nightmare. I did the simple thing first:
merge(a, b, by = "ID", all = T)
It almost does the trick but ends up with duplicate row "C"s:
ID A B C.x C.y D
1 1 21 31 41 NA NA
2 2 22 32 42 NA NA
3 3 23 33 43 NA NA
4 4 24 34 44 22 32
5 5 NA NA NA 23 33
6 6 NA NA NA 24 34
This wouldn't be so bad if I could find out how to merge the duplicate rows correctly because then I could just run
merge(a[-4], b[-2], by = "ID", all = T)
ID A B D
1 1 21 31 NA
2 2 22 32 NA
3 3 23 33 NA
4 4 24 34 32
5 5 NA NA 33
6 6 NA NA 34
to merge everything else, then bring in the merged "C" after the fact.
But I can't figure it out how to deal with this part of it:
merge(a[c(1,4)], b[c(1,2)], by = "ID", all = T)
ID C.x C.y ID C
1 1 41 NA 1 1 41
2 2 42 NA 2 2 42
3 3 43 NA -> 3 3 43
4 4 44 22 4 4 22
5 5 NA 23 5 5 23
6 6 NA 24 6 6 24
There's gotta be way.
Thanks for your help!
For anyone else looking at this in the future, I realized this could also be solved using the following in base rather than dplyr:
df <- merge(a, b, by = "ID", all = T)
df[,"C"] <- ifelse(is.na(df[,"C.y"]), df[,"C.x"], df[,"C.y"])
df <- df[,-c(match("C.x", names(df)),match("C.y", names(df)))]
This ended up being the method I used because down the road I came to needing to perform some steps that were very difficult with dplyr for a novice (using variables inside mutate() and select()) and much more straightforward in base using the above syntax.
Thanks again to CPak, without whom I could not have figured this out.
Try this
library(dplyr)
starthere <- merge(a, b, by = "ID", all = T)
starthere %>%
mutate(C = ifelse(is.na(C.y), C.x, C.y)) %>%
select(-C.x, -C.y)
# ID A B D C
# 1 1 21 31 NA 41
# 2 2 22 32 NA 42
# 3 3 23 33 NA 43
# 4 4 24 34 32 22
# 5 5 NA NA 33 23
# 6 6 NA NA 34 24
My dataframe needs to be expanded
df1<-structure(list(TotalTime = c(0, 15, 16, 23, 24, 29), PhaseName = structure(c(1L,1L, 2L, 2L, 2L, 3L), .Label = c("A", "B","C"), class = "factor")), .Names = c("TotalTime", "Phase"), row.names = c(NA, 6L), class = "data.frame")
df1:
TotalTime Phase
1 0 A
2 15 A
3 16 B
4 23 B
5 24 B
6 29 C
So that it becomes the following dataframe with rows that are duplicated based on TotalTime, however TotalTime should be filled in for every number (second). (I put ... in the example to reduce space, but should be filled with 6,7,8,9-15 etc.) :
TotalTime Phase
1 0 A
2 1 A
3 2 A
4 3 A
5 4 A
6 5 A
..
16 15 A
17 16 B
18 17 B
.. B
24 23 B
25 24 B
26 25 B
27 26 B
28 27 B
29 28 B
30 29 C
using both packages zoo and dplyr:
library(dplyr)
library(zoo)
data.frame(TotalTime=0:max(df1$TotalTime)) %>% left_join(df1) %>% na.locf
It first creates a data.frame that has the hole sequence from 0 to 29 (here) and merges it with your data. Then I simply do a "last observation carried forward" imputation on the missing values created by the merge.
It can also be done with the library data.table like this: (see also this answer that I adapted:
library(data.table)
df1 = data.table(df1, key="TotalTime")
df2=data.table(TotalTime=0:max(df1$TotalTime))
df1[df2, roll=T]
You can get it done with dplyr with tidyr:
library(tidyverse)
df1 %>% do(data.frame(TotalTime = first(.$TotalTime):last(.$TotalTime))) %>%
left_join(df1, by="TotalTime") %>%
fill(Phase)
Output:
TotalTime Phase
0 A
1 A
2 A
3 A
4 A
5 A
6 A
7 A
8 A
9 A
10 A
11 A
12 A
13 A
14 A
15 A
16 B
17 B
18 B
19 B
20 B
21 B
22 B
23 B
24 B
25 B
26 B
27 B
28 B
29 C
I hope this helps.
In case you want to see a base R solution.
phases <- with(aggregate(TotalTime~Phase, df1, FUN=min),
rep(Phase, c(diff(TotalTime),
max(df1$TotalTime[df1$Phase == tail(Phase, 1)]) -
min(df1$TotalTime[df1$Phase == tail(Phase, 1)])+1)))
The main "trick" here is in that the second argument of rep can be a vector, which then repeats each element of the first argument that many times. The second argument is constructed using the difference of the minimum values of each phase diff(TotalTime) and concatenating the difference of the min and max value (+1) of the final phase level (here, "C"). The minimum values are found with aggregate, and I use with to simplify notation.
The result can then be fed to the data.frame.
data.frame(period=seq_len(length(phases))-1, phase=phases)
period phase
1 0 A
2 1 A
3 2 A
4 3 A
5 4 A
6 5 A
7 6 A
8 7 A
9 8 A
10 9 A
11 10 A
12 11 A
13 12 A
14 13 A
15 14 A
16 15 A
17 16 B
18 17 B
19 18 B
20 19 B
21 20 B
22 21 B
23 22 B
24 23 B
25 24 B
26 25 B
27 26 B
28 27 B
29 28 B
30 29 C