Removing rows of data in R

Removing rows of data in R - r

What's the most reliable way to remove matching Ids from two large data frames in large?
For example, I have a list of participants who do not want to be contacted (n=200). I would like to remove them from my dataset of over 100 variables and 200,000 observations.
This is the list of 200 participants ids that I need to remove from the dataset.
exclude=read.csv("/home/Project/file/excludeids.csv", header=TRUE, sep=",")
dataset.exclusion<- dataset[-which(exclude$ParticipantId %in% dataset$ParticipantId ), ]
Is this the correct command to use?
I don't think this command is doing what I want, because when I verify with the following: length(which(dataset.exclusion$ParticipantId %in% exclusion$ParticipantId))
I don't get 0.
Any insight?

You can do this for example:
sample1[!sample1$ParticipantID %in%
unique(exclusion$ParticipantId),]

Something like this?
library(data.table)
dataset <- data.table(
a = c(1,2,3,4,5,6),
b = c(11,12,13,14,15,16),
d = c(21,22,23,24,25,26)
)
setkeyv(dataset, c('a','b'))
ToExclude <- data.table(
a = c(1,2,3),
b = c(11,12,13)
)
dataset[!ToExclude]
# a b d
# 1: 4 14 24
# 2: 5 15 25
# 3: 6 16 26

Related

Create dataframe from smallest vector available

I want to create a dataframe from a list of dataframes, specifically from a certain column of those dataframes. However each dataframe contains a different number of observations, so the following code gives me an error.
diffs <- data.frame(sensor1 = sensores[[1]]$Diff,
sensor2 = sensores[[2]]$Diff,
sensor3 = sensores[[3]]$Diff,
sensor4 = sensores[[4]]$Diff,
sensor5 = sensores[[5]]$Diff)
The error:
Error in data.frame(sensor1 = sensores[[1]]$Diff, sensor2 = sensores[[2]]$Diff, :
arguments imply differing number of rows: 29, 19, 36, 26
Is there some way to force data.frame() to take the minimal number or rows available from each one of the columns, in this case 19?
Maybe there is a built-in function in R that can do this, any solution is appreciated but I'd love to get something as general and as clear as possible.
Thank you in advance.

I can think of two approaches:
Example data:
df1 <- data.frame(A = 1:3)
df2 <- data.frame(B = 1:4)
df3 <- data.frame(C = 1:5)
Compute the number of rows of the smallest dataframe:
min_rows <- min(sapply(list(df1, df2, df3), nrow))
Use subsetting when combining:
diffs <- data.frame(a = df1[1:min_rows,], b = df2[1:min_rows,], c = df3[1:min_rows,] )
diffs
a b c
1 1 1 1
2 2 2 2
3 3 3 3
Alternatively, use merge:
rowmerge <- function(x,y){
# create row indicators for the merge:
x$ind <- 1:nrow(x)
y$ind <- 1:nrow(y)
out <- merge(x,y, all = T, by = "ind")
out["ind"] <- NULL
return(out)
}
Reduce(rowmerge, list(df1, df2, df3))
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 NA 4 4
5 NA NA 5
To get rid of the rows with NAs, remove the all = T.
For your particular case, you would probably call Reduce(rowmerge, sensores), assuming that sensores is a list of dataframes.
Note: if you already have an index somewhere (e.g. a timestamp of some sort), then it would be advisable to simply merge on that index instead of creating ind.

Find the mean of one variable subseted by another variable

I have a list of identical dataframes. Each data frame contains columns with unique variables (temp/DO) and with repeated variables (eg-t1).
[[1]]
temp DO t1
1 4 1
3 9 1
5 7 1
I want to find the mean of DO when the temperature is equal to t1.
t1 represents a specific temperature, but the value varies for each data frame in the list so I can't specify an actual value.
So far I've tried writing a function
hvod<-function(DO, temp, depth){
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}
hfit<-hvod(data$DO, data$temp, data$depth)
But for whatever reason t1 is not recognized. Any ideas on the function OR
a way to combine select (dplyr function) and lapply to solve this?
I've seen similar posts put none that apply to the issue of a specific value (t1) that changes for each data frame.

I would just take the dataframe as argument and do rest of the logic inside function as it gives more control to the function. Something like this would work,
hvod<-function(data){
temp <- data$temp
t1 <- data$t1
DO <- data$DO
hDO<-DO[which(temp==t1[1])]
mHDO<-mean(hDO)
htemp<-temp[which(temp=t1[1])]
mhtemp<-mean(htemp)
}

You can try using dplyr::bind_rows function to combine all data.frames from list in one data.frame.
Then group on data.frame number to find the mean of DO for rows having temp==t1 as:
library(dplyr)
bind_rows(ll, .id = "DF_Name") %>%
group_by(DF_Name) %>%
filter(temp==t1) %>%
summarise(MeanDO = mean(DO)) %>%
as.data.frame()
# DF_Name MeanDO
# 1 1 4.0
# 2 2 6.5
# 3 3 6.7
Data:
df1 <- read.table(text =
"temp DO t1
1 4 1
3 9 1
5 7 1",
header = TRUE)
df2 <- read.table(text =
"temp DO t1
3 4 3
3 9 3
5 7 1",
header = TRUE)
df3 <- read.table(text =
"temp DO t1
2 4 2
2 9 2
2 7 2",
header = TRUE)
ll <- list(df1, df2, df3)

Thank you Thiloshon and MKR for the help! I had initial combined the data I needed into one list of data frames but to answer this I actually had my data in separate data frames (fitsObs and df1).
The variables I was working with in the code were 1 to 1, so by finding the range where depth and d2 were the same (I used temp and t1 in the example), I could find the mean over that range .
for(i in 1:1044){
df1 <- GLNPOsurveyCTD$data[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
deptho <- -abs(df1$depth) #defining temp and depth in the loop
to <- df1$temp
do <- df1$DO
xx <- which(deptho <= fitObs$d2) #mean over range xx
mhtemp <- mean(to[xx], na.rm=TRUE)
mHDO <- mean(do[xx], na.rm=TRUE)
}

Finding patterns across rows of data.table in R

I am trying to find patterns across rows of a data.table while still maintaining the linkages of data across the rows. Here is a reduced example:
Row ID Value
1 C 1000
2 A 500
3 T -200
4 B 5000
5 T -900
6 A 300
I would like to search for all instances of "ATB" in successive rows and output the integers from the value column. Ideally, I want to bin the number of instances as well. The output table would look like this:
String Frequency Value1 Value2 Value 3
ATB 1 500 -200 5000
CAT 1 1000 500 -200
Since the data.table packages seems to be oriented towards providing operations on a column or row-wise basis I thought this should be possible. However, I haven't the slightest idea where to start. Any pointers in the right direction would be greatly appreciated.
Thanks!

library("plyr")
library("stringr")
df <- read.table(header = TRUE, text = "Row ID Value
1 C 1000
2 A 500
3 T -200
4 B 5000
5 T -900
6 A 300
7 C 200
8 A 700
9 T -500")
sought <- c("ATB", "CAT", "NOT")
ids <- paste(df$ID, collapse = "")
ldply(sought, function(id) {
found <- str_locate_all(ids, id)
if (nrow(found[[1]])) {
vals <- outer(found[[1]][,"start"], 0:2, function(x, y) df$Value[x + y])
} else {
vals <- as.list(rep(NA, 3))
}
data.frame(ID = id, Count = str_count(ids, id),
setNames(as.data.frame(vals), paste0("Value", 1:3)))
})
Here's a solution using stringr and plyr. The ids are collapsed into a single string, all instances of each target located and then a data frame constructed with the relevant columns.

How to create single table by extracting certain cells from multiple CSV files

I am wondering if it is possible to create a new dataframe with certain cells from each file from the working directory. for example say If I have 2 data frame like this (please ignore the numbers as they are random):
Say in each dataset, row 4 is the sum of my value and Row 5 is number of missing values. If I represent number of missing values as "M" and Sum of coloumns as "N", what I am trying to acheive is the following table:
So each file 'N' and 'M' are in 1 single row.
I have many files in the directory so I have read them in a list, but not sure what would be the best way to perform such task on a list of files.
this is my sample code for the tables I have shown and how I read them in list:
##Create sample data
df = data.frame(Type = 'wind', v1=c(1,2,3,100,50), v2=c(4,5,6,200,60), v3=c(6,7,8,300,70))
df2 =data.frame(Type = 'test', v1=c(3,2,1,400,40), v2=c(2,3,4,500,30), v3=c(6,7,8,600,20))
# write to directory
write.csv(df, file = "sample1.csv", row.names = F)
write.csv(df2, file = "sample2.csv", row.names = F)
# read to list
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = TRUE)
I would be really greatful if you could give me some suggestion about if this possible and how I should approch?
Many thanks,
Ayan

This should work:
processFile <- function(File) {
d <- read.csv(File, skip = 4, nrows = 2, header = FALSE,
stringsAsFactors = FALSE)
dd <- data.frame(d[1,1], t(unlist(d[-1])))
names(dd) <- c("ID", "v1N", "V1M", "v2N", "V2M", "v3N", "V3M")
return(dd)
}
ll <- lapply(mycsv, processFile)
do.call(rbind, ll)
# ID v1N V1M v2N V2M v3N V3M
# 1 wind 100 50 200 60 300 70
# 2 test 400 40 500 30 600 20
(The one slightly tricky/unusual bit comes in that third line of processFile(). Here's a code snippet that should help you see how it accomplishes what it does.)
(d <- data.frame(a="wind", b=1:2, c=3:4))
# a b c
# 1 wind 1 3
# 2 wind 2 4
t(unlist(d[-1]))
# b1 b2 c1 c2
# [1,] 1 2 3 4

CAVEAT: I'm not sure I fully understand what you want. I think you're reading in a list and want to select certain dataframes from that list with the same rows from that list. Then you want to create a data frame of those rows and go from long to wide format.
LIST <- lapply(2:3, function(i) {
x <- mylist[[i]][4:5, ]
x <- data.frame(x, row = factor(rownames(x)))
return(x)
}
)
DF <- do.call("rbind", LIST) #lets you bind an unknown number of rows from a list
levels(DF$row) <- list(M =4, N = 5) #recodes rows 4 and 5 with M and N
wide <- reshape(DF, v.names=c("v1", "v2", "v3"), idvar=c("Type"),
timevar="row", direction="wide") #reshape from long to wide
rownames(wide) <- 1:nrow(wide) #give proper row names
wide
This yields:
Type v1.M v2.M v3.M v1.N v2.N v3.N
1 wind 100 200 300 50 60 70
2 test 400 500 600 40 30 20

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks

This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2

This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.

The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex