My apologies if this is a basic question. I'm new to R.
I have a dataset, DAT, which has 3 variables: ID, V1 and V2. Unfortunately, V2 data are missing for many cases. I want to create a new variable, V3. I want V3 to have the same values as V2, but for any case that has a missing value for V2, I want V3 to take the value of V1 instead. What is the most efficient way to do this in R?
One approach using the dplyr package.
# Step 1: Load verb-like data wrangling package.
library(dplyr)
# Step 2: Create some data.
df <- data.frame(ID=1:5, V1 = 11:15, V2 = c(31:33, NA, NA))
ID V1 V2
1 11 31
2 12 32
3 13 33
4 14 NA
5 15 NA
# Step 3: Create a variable V3 using your criteria
df <- mutate(df, V3 = if_else(is.na(V2), V1, V2))
ID V1 V2 V3
1 11 31 31
2 12 32 32
3 13 33 33
4 14 NA 14
5 15 NA 15
Using the data.table package would probably be more efficient if you have a big data frame.
You can also use the ifelse statement.
DAT$V3 <- ifelse(is.na(DAT$V2), DAT$V1, DAT$V2)
Reads as if V2 is blank then use V1, otherwise use the data in V2.
Related
How can I filter in R only those row in which the value for Column V6 appears exactly 2 times?
my dataseta date:
I try:
library(dplyr)
df <- as.data.frame(date)
df1 <- subset(df,duplicated(V6))
but it does not work.
You can use a contingency table to get the value counts. Here's some example code.
# Make some dummy data (only 8 and 2 appear exactly twice in this example)
df <- data.frame(V1=1:10,
V2=11:10,
V6=c(1,2,8,3,4,3,2,3,8,7))
# Get table of counts for column "V6"
tab <- table(df$V6)
# Get values that appear exactly twice
twice <- as.numeric(names(tab)[tab == 2])
# Filter the data frame based on these values
df <- df[df$V6 %in% twice,]
Output:
V1 V2 V6
2 2 10 2
3 3 11 8
7 7 11 2
9 9 11 8
I have a data frame like this
V1 V2
10 5
20 4
30 8
40 6
10 10
20 7
30 4
40 9
And I would like to have all the values relating to the same V1 in one row, like so...
V1 V2 V3
10 5 10
20 4 7
30 8 4
40 6 9
Here is a solution in base R. You can feed the uniques in row V1 into a lapply and extract all values in V2 for each unique V1. This you feed into a call to do.call (because the result from lapply is a list) with rbind, and then you merge the resulting matrix with the vector of uniques via cbind.
# Create df1 for demonstration
df1 = data.frame(a = rep(1:4, 10), b = sample(1:40))
output = cbind(unique(df1$a), do.call(rbind, lapply(unique(df1$a), function(x) df1$b[df1$a == x])))
This solution depends on the values inside the source data frame to be of the same type. If they are not, you might have to invest some time into casting the data into the correct types or so. But this should not be a problem.
You can do what you want with apply functions.
DF <- data.frame(A = c(1:5,1:5),B=11:20)
lst <- lapply(unique(DF$A),function(AA) DF[DF$A ==AA,'B'])
Result <- do.call(rbind,lst)
If you wish to have the A column back in you can use Results <- cbind(A=names(lst),Results)
Be careful, this will give you a matrix not a data.frame. If your values are not numeric like this example that may cause some issues.
There are some alternate ways to do this using Data Tables or dplyr.
We can do this with dcast from data.table
library(data.table)
dcast(setDT(df1), V1~paste0("V", rowid(V1)+1))
# V1 V2 V3
#1: 10 5 10
#2: 20 4 7
#3: 30 8 4
#4: 40 6 9
I'm very sorry for asking this question, because I saw something similar in the past but I couldn't find it (so duplication will be understandable).
I have 2 data frames, and I want to move all my (matching) customers who appears in the 2 data frames into one of them. Please pay attention that I want to add the entire row.
Here is an example:
# df1
customer_ip V1 V2
1 15 20
2 12 18
# df2
customer_ip V1 V2
2 45 50
3 12 18
And I want my new data frames to look like:
# df1
customer_ip V1 V2
1 15 20
2 12 18
2 45 50
# df2
customer_ip V1 V2
3 12 18
Thank you in advance!
This does it.
df1<-rbind(df1,df2[df2$customer_ip %in% df1$customer_ip,])
df2<-df2[!(df2$customer_ip %in% df1$customer_ip),]
EDIT: Gaurav & Sotos got here before me whilst I was writing with essentially the same answer, but I'll leave this here as it shows the code without the redundant 'which'
This should do the trick:
#Add appropriate rows to df1
df1 <- rbind(df1, df2[which(df2$customer_ip %in% df1$customer_ip),])
#Remove appropriate rows from df2
df2 <- df2[-which(df2$customer_ip %in% df1$customer_ip),]
I got a data frame in R like the following:
V1 V2 V3
1 2 3
1 43 54
2 34 53
3 34 51
3 43 42
...
And I want to delete all rows which value of V1 has a frequency lower then 2. So in my example the row with V1 = 2 should be deleted, because the value "2" only appears once in the column ("1" and "3" appear twice each).
I tired to add a extra column with the frequency of V1 in it to delete all rows where the frequency is > 1 but with the following I only get NAs in the extra column.
data$Frequency <- table(data$V1)[data$V1]
Thanks
You can try this:
library(dplyr)
df %>% group_by(V1) %>% filter(n() > 1)
You can also consider using data.table. We first count the occurence of each value in V1, then filter on those occurences being more than 1. Finally, we remove our count-column as we no longer need it.
library(data.table)
setDT(dat)
dat2 <- dat[,n:=.N,V1][n>1,,][,n:=NULL]
Or even quicker, thanks to RichardScriven:
dat[, .I[.N >= 2], by = V1]
> dat2
V1 V2 V3
1: 1 2 3
2: 1 43 54
3: 3 34 51
4: 3 43 42
With this you do not need to load a library
res<-data.frame(V1=c(1,1,2,3,3,3),V2=rnorm(6),V3=rnorm(6))
res[res$V1%in%names(table(res$V1)>=2)[table(res$V1)>=2],]
I have a fairly large data frame with one numeric column and a bunch of factors. One of these factors has only two values. I want to generate a new, smaller data frame that divides the numeric variable by another value in the same column.
Example Data:
set.seed(1)
V1 <- rep(c("a","b"), each =8)
V2 <- 1:4
V3 <- rep(c("High","Low"), each=4)
V4 <- rnorm(16)
foo <- data.frame(V1,V2,V3,V4)
Which gives me the following data frame:
V1 V2 V3 V4
1 a 1 High -0.62645381
2 a 2 High 0.18364332
3 a 3 High -0.83562861
4 a 4 High 1.59528080
5 a 1 Low 0.32950777
6 a 2 Low -0.82046838
7 a 3 Low 0.48742905
8 a 4 Low 0.73832471
9 b 1 High 0.57578135
10 b 2 High -0.30538839
11 b 3 High 1.51178117
12 b 4 High 0.38984324
13 b 1 Low -0.62124058
14 b 2 Low -2.21469989
15 b 3 Low 1.12493092
16 b 4 Low -0.04493361
I want to generate a smaller data frame that divides V4(High) by the matching V4(Low)
V1 V2 V4
1 a 1 -1.901181 #foo[1,4]/foo[5,4]
2 a 2 -0.223827 #foo[2,4]/foo[6,4]
...
The problem is my real data is messier than this. I do know that V3 repeats regularly, there is a High for every Low, but V2 and V1 do not repeat regularly like I've demonstrated here. They are not highly irregular, but there are a few dropped values (i.e. b3Low and b3High might have been dropped)
I'm assuming I'm going to have to restructure my data frame somehow, but I have no idea where to even start. Thanks in advance.
Here's an option using dplyr and reshape2:
library(dplyr)
library(reshape2)
foo %>% dcast(V1 + V2 ~ V3, value.var="V4") %>%
mutate(Ratio = High/Low) %>%
select(V1, V2, Ratio)
V1 V2 Ratio
1 a 1 -1.9011807
2 a 2 -0.2238274
3 a 3 -1.7143595
4 a 4 2.1606764
5 b 1 -0.9268251
6 b 2 0.1378915
7 b 3 1.3438880
8 b 4 -8.6759832
Get rid of the select statement if you want to keep the High and Low columns in the final result.
Or with dplyr alone:
foo %>% group_by(V1, V2) %>%
summarise(Ratio = V4[V3=="High"]/V4[V3=="Low"])
Or with data.table:
library(data.table)
setDT(foo)[ , list(Ratio = V4[V3=="High"]/V4[V3=="Low"]), by=list(V1, V2)]
One way to do it would be to first split the dataframe by V3. Then, if they're ordered correctly, it's straightforward. If not, then merge them into a single dataframe and proceed from there. For example:
# Split foo
fooSplit <- split(foo, foo$V3)
#If ordered correctly (as in the example)
fooSplit[[1]]$V4 / fooSplit[[2]]$V4
# [1] -1.9011807 -0.2238274 -1.7143595 2.1606764 -0.9268251 0.1378915 1.3438880 -8.6759832
#If not ordered correctly, merge into new dataframe
#Rename variables in prep for merge
names(fooSplit[[1]])[4] <- "High"
names(fooSplit[[2]])[4] <- "Low"
#Merge into a single dataframe, drop V3
d <- merge(fooSplit[[1]][,-3], fooSplit[[2]][,-3], by = 1:2, all = TRUE)
d$High / d$Low
# [1] -1.9011807 -0.2238274 -1.7143595 2.1606764 -0.9268251 0.1378915 1.3438880 -8.6759832
I think the dpyr package could help you with that.
Following with your code:
You can create a "key" column to use when crossing references between "High" and "Low" values.
foo <- mutate(foo,paste(V1,V2))
names(foo) <- c("V1","V2","V3","V4","key")
Now you have the "key" column you could use filter, to split the data set in two groups ("High" and "Low"), merge to join them using the "key column and select to tide up the data set and just keep the important columns.
foo <- select(merge(filter(foo,V3=="High"), filter(foo,V3=="Low"),
by="key"), V1.x, V2.x, V4.x, V4.y)
At last, when you have the data on the same table, you could create a new calculated column using mutate. We use select again to keep the data set as simple as possible.
foo <- select(mutate(foo,V4.x/V4.y,name="V4"),1,2,5)
So, if you execute:
foo <- mutate(foo,paste(V1,V2))
names(foo) <- c("V1","V2","V3","V4","key")
foo <- select(merge(filter(foo,V3=="High"), filter(foo,V3=="Low"),
by="key"), V1.x, V2.x, V4.x, V4.y)
foo <- select(mutate(foo,V4.x/V4.y,name="V4"),1,2,5)
you will get:
# V1.x V2.x V4.x/V4.y
#1 a 1 -1.9011807
#2 a 2 -0.2238274
#3 a 3 -1.7143595
#4 a 4 2.1606764
#5 b 1 -0.9268251
#6 b 2 0.1378915
#7 b 3 1.3438880
#8 b 4 -8.6759832
Probably it´s not shortest way to do it but I hope it will help you.