I tried to use the rbind.fill function from the plyr package to combine two dataframes with a column A, which contains only digits in the first dataframe, but (also) strings in the second dataframe. Reproducible example:
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666))
rbind.fill(data1,data2)
This produced the output below with incorrect data in column A, row 4,5,6. It did not produce an error message.
A b c
1 107778 33434 6
2 1756756 4 7
3 2324234 5 8
4 2 NA 14562
5 3 NA 45613
6 1 NA 14
I had expected that the function would coerce the whole column into character class, or at least display NA or a warning. Instead, it inserted digits that I do not understand (in the actual file, these are two digit numbers that are not sorted). The documentation does not specify that columns must be of the same type in the to-be-combined data.frames.
How can I get this combination?
A b c
1 11111 4444 5555
2 22222 444 66666
3 33333 44444 7777
4 1234 NA 888
5 ss150 NA 777
6 123456 NA 666
look at class(data2$A). It's a factor which is actually an integer with a label vector. Use stringsAsFactors=F in your data.frame creation or in read.csv and friends. This will force the variables be either numeric or character vectors.
data1 <- data.frame(A=c(11111,22222,33333), b=c(4444,444,44444), c=c(5555,66666,7777))
data2 <- data.frame(A=c(1234,"ss150",123456), c=c(888,777,666), stringsAsFactors=FALSE)
rbind.fill(data1,data2)
Related
I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.
It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.
In ui.R
textInput("numbers","New try")
# Example input, 2345 654774 647
In server.R
x = as.numeric(unlist(strsplit(input$numbers, '')))
Output is
2 3 4 5 NA 6 5 4 7 7 4 NA 6 4 7
This is converting the numbers to atomic vectors. I want the numbers to be converted to dataframe with the numbers preserved. For example, 2345 654774 647
in dataframe.
We can use regex to search for whole numbers in the string instead of separating into individual numbers. From this all the matches can be put into a data frame.
library(stringr)
input<-"2345 654774 647"
Match<-str_match_all(input, "\\d+")
DF<-as.data.frame(Match)
names(DF)<-("Test")
DF
Test
1 2345
2 654774
3 647
I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
I have a file that looks like so:
date A B
2014-01-01 2 3
2014-01-02 5 NA
2014-01-03 NA NA
2014-01-04 7 11
If I use newdata <- na.omit(data) where data is the above table loaded via R, then I get only two data points. I get that since it will filter all instances of NA. What I want to do is to filter for each A and B so that I get three data points for A and only two for B. Clearly, my main data set is much larger than that and the numbers are different but neither should not matter.
How can I achieve that?
Use is.na() on the relevant vector of data you wish to look for and index using the negated result. For exmaple:
R> data[!is.na(data$A), ]
date A B
1 2014-01-01 2 3
2 2014-01-02 5 NA
4 2014-01-04 7 11
R> data[!is.na(data$B), ]
date A B
1 2014-01-01 2 3
4 2014-01-04 7 11
is.na() returns TRUE for every element that is NA and FALSE otherwise. To index the rows of the data frame, we can use this logical vector, but we want its converse. Hence we use ! to imply the opposite (TRUE becomes FALSE and vice versa).
You can restrict which columns you return by adding an index for the columns after the , in [ , ], e.g.
R> data[!is.na(data$A), 1:2]
date A
1 2014-01-01 2
2 2014-01-02 5
4 2014-01-04 7
Every column in a data frame must have the same number of elements, that is why NAs come in handy in the first place...
What you can do is
df.a <- df[!is.na(df$A), -3]
df.b <- df[!is.na(df$B), -2]
In case of Python we can use subset to define column/columns and inplace true is to make the changes in DF:-
rounds2.dropna(subset=['company_permalink'],inplace=True)
Given the basic tools I know now (which, order, if, %in%, order, etc..), I am running frequently into one problem I call "the uniqueness problem".
The problem basically looks like this...
I have a matrix A I want filled out from another raw matrix, B.
A:
[upc] [day1] [day2] ... day52
[1] 123 NA NA NA
[2] 456 NA NA NA
[3] 789 NA NA NA
B is mega huge row wise, so looping is out of the question.
[upc] [quantity] [day]
[1] 123 11 1
[2] 123 2 1
[3] 789 5 1
[4] 456 10 1
[5] 789 6 1
I want to fill up day1 for each UPC in matrix A with the quantities in matrix B. The problem is that there are multiple instances of each UPC in B, and I can't loop over them to get the total quantity to put next to each upc.
So what I WANT is this.. (which would be filled out TOTALLY, i.e. days 2-52 ..by looping over the other days, which is small and thus manageable)
A:
[upc] [day1] [day2] ... day52
[1] 123 13 NA NA
[2] 456 10 NA NA
[3] 789 11 NA NA
Do you know any functions that can accomplish this without looping?
If you convert your original matrices to data.frames, you can employ aggregate,merge and reshape to get there:
Make some data including multiple days for the added id of 999:
A <- data.frame(upc=c(123,456,789,999))
B <- data.frame(
upc=c(123,123,789,456,789,999,999,999),
quantity=c(11,2,5,10,6,10,3,3),
day=c(1,1,1,1,1,1,2,2)
)
Aggregate the quantities by id and day, then merge and reshape:
mrgd <- merge(A,aggregate(quantity ~ upc + day ,data=B, sum),by="upc")
final <- reshape(mrgd,idvar="upc",timevar="day",direction="wide",sep="")
names(final) <- gsub("quantity","day",names(final))
Which gives:
final
# upc day1 day2
#1 123 13 NA
#2 456 10 NA
#3 789 11 NA
#4 999 10 6
You can create a matrix A using the tapply function:
> B <- data.frame(
+ upc=c(123,123,789,456,789,999,999,999),
+ quantity=c(11,2,5,10,6,10,3,3),
+ day=c(1,1,1,1,1,1,2,2)
+ )
> tapply( B$quantity, B[,c('upc','day')], FUN=sum )
day
upc 1 2
123 13 NA
456 10 NA
789 11 NA
999 10 6
>
If the B matrix is really huge then you might consider saving it as an ff object (ff package) then using ffrowapply to do it in chunks.