I have two dataframes, df1 and df2.
df1:
col1 <- c('30','30','30','30')
col2 <- c(3,13,18,41)
col3 <- c("heavy","light","blue","black")
df1 <- data.frame(col1,col2,col3)
>df1
col1 col2 col3
1 30 3 heavy
2 30 13 light
3 30 18 blue
4 30 41 black
df2:
col1 <- c('10',"NONE")
col2 <- c(21,"NONE")
col3 <- c("blue","NONE")
df2 <- data.frame(col1,col2,col3)
>df2
col1 col2 col3
1 10 21 blue
2 NONE NONE NONE
I wrote a bit of script that says; if a value in col3 is equal to "light", I want to remove that row and all subsequent rows in the dataframe. So df1 would look like:
col1 col2 col3
1 30 3 heavy
And there would be no changes to df2 (as it has no matches to "light" in col3).
I have stated there are two separate df's above as two examples, but the script below just refers to a general "df" to save me copying and pasting the same bit of code twice with df1 repalced with df2.
phrase=c("light")
start_rownum=which(grepl(phrase, df[,3]))
end_rownum=nrow(df)
end_rownum=as.numeric(end_rownum)
if(start_rownum > 0){
df=df[-c(start_rownum:end_rownum),]
}
This script works fine with df1, as the start_rownum has a numerical value. However, I get the following error with df2:
Error in start_rownum:end_rownum : argument of length 0
Instead of saying "if(start_rownum > 0)", is there some way to check if start_rownum has a numerical value? I can't find a working solution.
Thanks.
For anyone who has a similar problem, I just solved it:
Use the phrase
if (length(start_rownum)>0 & is.numeric(start_rownum))
Related
How to make values in columns 1,2,3,4 appear as values in a single column 1 in which values are placed one below the other? The contents are non numeric. I am unable to install tidy verse package for some reason. Any other way possible to accomplish? My dataframe df looks something like this
df
Person1 Person2 Person3
Doctor Self No
Friend No OthersSelf Others Doctor I want the dataframe to be:df1PersonDoctorFriendSelfSelfNoOthersNoOthersDoc
As a general rule, check this excellent answer on how to make a reproducible example in R. It'll help others to provide answers faster.
You can find a way to get your data in a long format using tidyr (assuming your columns are filled with stirngs as you mentioned).
> df <- data.frame(col1 = c("some", "strings"), col2 = c("more", "strings"), col3 = c("lotof", "strings"))
> df
col1 col2 col3
1 some more lotof
2 strings strings strings
> library(tidyr)
> pivot_longer(df, c(col1, col2, col3))
# A tibble: 6 x 2
name value
<chr> <fct>
1 col1 some
2 col2 more
3 col3 lotof
4 col1 strings
5 col2 strings
6 col3 strings
Regarding package installation problems, could you copy the error that pops out and the console output of sessionInfo()
data.table::melt()
or
reshape::melt()
Example
library(reshape)
mdata <- melt(mydata, id=c("id","te"))
Result
d t v value
1 1 x1 5
1 2 x1 3
2 1 x1 6
2 2 x1 2
This Issue is almost what I wanted to do, except by the fact of an output being giving as a list of data frames. Let's reproduce the example of mentioned SE issue above.
Let's say I have 2 data frames:
df1
ID col1 col2
x 0 10
y 10 20
z 20 30
df1
ID col1 col2
a 0 10
b 10 20
c 20 30
What I want is an 4th column with an ifelse result. My rationale is:
if col1>=20 in any data.frame I could have named with the pattern "df", then the new column res=1, else res=0.
But I want to create a new column in each data.frame with the same name pattern, not put all of those data.frames in a list and apply the function, except if I could "extract" each 3rd dimension of this list back to individual data frames.
Thanks
Per #Frank...if my understanding of what you are looking for is correct, consider using data.table. MWE:
library(data.table);
addcol <- function(x) x[,res:=ifelse(col1>=20,1,0)]
df1 <- data.table(ID=c("x","y","z"),col1=c(0,10,20),col2=c(10,20,30))
df2 <- data.table(ID=c("x","y","z"),col1=c(20,10,20),col2=c(10,20,30))
#modified df2 so you can see different effects
lapply(list(df1,df2),addcol)
> df1
ID col1 col2 res
1: x 0 10 0
2: y 10 20 0
3: z 20 30 1
> df2
ID col1 col2 res
1: x 20 10 1
2: y 10 20 0
3: z 20 30 1
This works because data.table operates by reference on tables, so inside the function you're actually updating the underlying table, not only the scoped reference to the table.
This question already has answers here:
Data Table - Select Value of Column by Name From Another Column
(3 answers)
Closed 4 years ago.
I have a data.table like this:
col1 col2 col3 new
1 4 55 col1
2 3 44 col2
3 34 35 col2
4 44 87 col3
I want to populate another column matched_value that contains the values from the respective column names given in the new column:
col1 col2 col3 new matched_value
1 4 55 col1 1
2 3 44 col2 3
3 34 35 col2 34
4 44 87 col3 87
E.g., in the first row, the value of new is "col1" so matched_value takes the value from col1, which is 1.
How can I do this efficiently in R on a very large data.table?
An excuse to use the obscure .BY:
DT[, newval := .SD[[.BY[[1]]]], by=new]
col1 col2 col3 new newval
1: 1 4 55 col1 1
2: 2 3 44 col2 3
3: 3 34 35 col2 34
4: 4 44 87 col3 87
How it works. This splits the data into groups based on the strings in new. The value of the string for each group is stored in newname = .BY[[1]]. We use this string to select the corresponding column of .SD via .SD[[newname]]. .SD stands for Subset of Data.
Alternatives. get(.BY[[1]]) should work just as well in place of .SD[[.BY[[1]]]]. According to a benchmark run by #David, the two ways are equally fast.
We can match the 'new' column with the column names of the dataset to get the column index, cbind with the row index (1:nrow(df1)) and extract the corresponding elements of the dataset based on row/column index. It can be assigned to a new column.
df1$matched_value <- df1[-4][cbind(1:nrow(df1),match(df1$new, colnames(df1) ))]
df1
# col1 col2 col3 new matched_value
#1 1 4 55 col1 1
#2 2 3 44 col2 3
#3 3 34 35 col2 34
#4 4 44 87 col3 87
NOTE: If the OP have a data.table, one option is convert to data.frame or use with=FALSE while subsetting.
setDF(df1) #to convert to 'data.frame'.
Benchmarks
set.seed(45)
df2 <- data.frame(col1= sample(1:9, 20e6, replace=TRUE),
col2= sample(1:20, 20e6, replace=TRUE),
col3= sample(1:40, 20e6, replace=TRUE),
col4=sample(1:30, 20e6, replace=TRUE),
new= sample(paste0('col', 1:4), 20e6, replace=TRUE), stringsAsFactors=FALSE)
system.time(df2$matched_value <- df2[-5][cbind(1:nrow(df2),match(df2$new, colnames(df2) ))])
# user system elapsed
# 2.54 0.37 2.92
I think this is pretty simple. I have a dataframe called df. It has 51 columns. The rows in each column contains random integers. All I want to do as a loop is add all the integers in all the rows of each column and then store the output for each of the columns in a seperate list or dataframe.
The df looks like this
Col1 col2 col3 col4
34 12 33 67
22 1 56 66
Etc
The output I want is:
Col1 col2 col3 col4
56 13 89 133
I do want to do this as a loop as I want to apply what I've learnt here to a more complex script with similar output and I need to do it quick- can't quite master functions as yet...
You can use the built in function colSums for this:
> df <- data.frame(col1 = c(1,2,3), col2 = c(2,3,4))
> colSums(df)
col1 col2
6 9
Another option using a loop:
# Create the result data frame
> res <- data.frame(df[1,], row.names = c('count'))
# Populate the results
> for(n in 1:ncol(df)) { res[colnames(df)[n]] <- sum(df[n]) }
col1 col2
6 9
If you really want to use a loop over a vectorized solution, use apply to loop over columns (second argument equal to 2, 1 is to loop over rows), by mentioning the function you want (here sum):
df = data.frame(col1=1:3,col2=2:4,col3=3:5)
apply(df, 2, sum)
#col1 col2 col3
# 6 9 12
I have 2 dataframes.
df1-
col1 col2 col3 col4 col5
name1 A 23 x y
name1 A 29 x y
name1 B 17 x y
name1 A 77 x y
df2-
col1 col2 col3
B 17 LL1
Z 193 KK1
A 77 LO9
Y 80 LK2
I want to return those rows from df1 if col2 and col3 of df1 are not equal to col1 and col2 of df2.
The output should be-
col1 col2 col3 col4 col5
name1 A 23 x y
name1 A 29 x y
Solution I found-
unique.rows <- function (df1, df2) {
out <- NULL
for (i in 1:nrow(df1)) {
found <- FALSE
for (j in 1:nrow(df2)) {
if (all(df1[i,2:3] == df2[j,1:2])) {
found <- TRUE
break
}
}
if (!found) out <- rbind(out, df1[i,])
}
out
}
This solution is working fine but initially, I was applying for small dataframes. Now my df1 has about 10k rows and df2 has about 7 million rows. It is just running and running from last 2 days. Could anyone please suggest a fast way to do this?
try
> df1[!paste(df1$col2,df1$col3)%in%paste(df2$col1,df2$col2),]
col1 col2 col3 col4 col5
1 name1 A 23 x y
2 name1 A 29 x y
What is probably biting you is the line:
if (!found) out <- rbind(out, df1[i,])
You continuously grow a data.frame, which causes the operating system to allocate new memory for the object. I would recommend you preallocate a data.frame with enough room and then assign the right output to the right index. This should speed things up by several orders of magnitude.
In addition, R works vectorized so often there is no need for an explicit loop. See for example the answer by #ttmaccer. You could also take a look at data.table, which is lightning fast for these kinds of operations.