Invalid values in an array on executing for loop in R - r

I am new to R and stuck up in a very naive thing. I am getting 'NA' values in count array after executing following code:
i=1
j=2
l=1
count=0
while(j<length(positions)){
a=positions[i]
b=positions[j]
for(k in a:b){
if(y$feature[k]==x$feature[l]){
count[l]=count[l]+1
}
}
i=i+2
j=j+2
l=l+1
}
For reference, y and x data frames are as follows:
y data frame
positions id feature
1 1 45128
2 1 28901
3 1 48902
. .
. .
. .
. .
2344 1 45579
2345 2 37689
2346 2 45547
. .
. .
5677 2 12339
5678 3 98034
5679
.
.
x dataframe :
id feature
1 28901
2 23498
3 98906
. .
. .
. .
I have inserted the positions in the position array, at the point where new id starts and where it ends
positions is an array consisting of [1,2344,2345,5677,5678,7390,7391,...]. I am incrementing the for loop as elements in position array, i being 1,3,5... j being 2,4,6... If y$feature and x$feature match I increment count[l]
So first feature of x is compared with all features in y with id=1, second feature in x is compared with all features in y with id=2 and so on. When they match, count[l] is incremented. i and j are incremented twice, to make them start with correct positions. *But I just get a valid answer for count[1], rest all values are NA.
Please tell a reason why this happens and a valid way to do this using the loops.

It's because you are trying to add a nonexistent value count[l] to 1. You start out with count<-0, so count is of length one. There is no count[2], so a reference to count[2] returns NA. Then (assuming l = 2 in your loop), NA + l returns NA.
If you initialize count<-rep(0,length(positions)) this particular problem will go away.
Meanwhile, you can vectorize your operations quite a lot. I believe you can replace the k-loop with
count[l] <- sum(y$feature[a:b]==x$feature[l])
for one example.

Related

How to know if there is a different element in one array in Scilab?

My goal is to check if there are misplaced objects in one array.
for example the array is
2.
2.
2.
2.
2.
1.
3.
1.
3.
3.
3.
1.
3.
1.
1.
1.
1.
I want to know if the first 5 elements, 6 to 13 and 14-17 are the same.
The purpose of this is to identify the misplaced elements in a clustering solution.
I have tried for the first 5 elements
ISet=5
IVer=7
IVir=5
for i=1:ISet
if(isequal(FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2))==%f)
numMisp=numMisp+1
mprintf("Set misp: %i",numMisp)
end
end
For the next 6 to 13 elements
for i=ISet+1:IVer+ISet-1
if(isequal(FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2))==%f)
mprintf("%i %i Ver misp: %i\n",FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2),i)
numMisp=numMisp+1
end
end
For the next 14 to 17 elements
for i=IVer+ISet:IVer+IVir-1
if(isequal(FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2))==%f)
mprintf("%i %i Ver misp: %i\n",FIRSTMIN(i,1,2),FIRSTMIN(i+1,1,2),i)
numMisp=numMisp+1
mprintf("Vir misp: %i",i)
end
end
You can use unique for that purpose. For example the following test checks if the first five elements are the same
x=[2 2 2 2 2 1 3 1 3 3 3 1 3 1 1 1 1];
if length(unique(x(1:5))) == 1
//
end
You can do the the same for the other clusters by replacing 1:5 by 6:13 then 14:17.

How to perform pandas drop_duplicates based on index column

I am banging my head against the wall when trying to perform a drop duplicate for time series, base on the value of a datetime index.
My function is the following:
def csv_import_merge_T(f):
dfsT = [pd.read_csv(fp, index_col=[0], parse_dates=[0], dayfirst=True, names=['datetime','temp','rh'], header=0) for fp in files]
dfT = pd.concat(dfsT)
#print dfT.head(); print dfT.index; print dfT.dtypes
dfT.drop_duplicates(subset=index, inplace=True)
dfT.resample('H').bfill()
return dfT
which is called by:
inputcsvT = ['./input_csv/A08_KI_T*.csv']
for csvnameT in inputcsvT:
files = glob.glob(csvnameT)
print ('___'); print (files)
t = csv_import_merge_T(files)
print csvT
I receive the error
NameError: global name 'index' is not defined
what is wrong?
UPDATE:
The issue appear to arise when csv input files (which are to be concatenated) are overlapped.
inputcsvT = ['./input_csv/A08_KI_T*.csv'] gets files
A08_KI_T5
28/05/2015 17:00,22.973,24.021
...
08/10/2015 13:30,24.368,45.974
A08_KI_T6
08/10/2015 14:00,24.779,41.526
...
10/02/2016 17:00,22.326,41.83
and it runs correctly, whereas:
inputcsvT = ['./input_csv/A08_LR_T*.csv'] gathers
A08_LR_T5
28/05/2015 17:00,22.493,25.62
...
08/10/2015 13:30,24.296,44.596
A08_LR_T6
28/05/2015 17:00,22.493,25.62
...
10/02/2016 17:15,21.991,38.45
which leads to an error.
IIUC you can call reset_index and then drop_duplicates and then set_index again:
In [304]:
df = pd.DataFrame(data=np.random.randn(5,3), index=list('aabcd'))
df
Out[304]:
0 1 2
a 0.918546 -0.621496 -0.210479
a -1.154838 -2.282168 -0.060182
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
In [308]:
df.reset_index().drop_duplicates('index').set_index('index')
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252
EDIT
Actually there is a simpler method is to call duplicated on the index and invert it:
In [309]:
df[~df.index.duplicated()]
Out[308]:
0 1 2
index
a 0.918546 -0.621496 -0.210479
b 2.512519 -0.771701 -0.328421
c -0.583990 -0.460282 1.294791
d -1.018002 0.826218 0.110252

Using condition in columns of data frame to generate a vector in R

I have the following array:
Year Month Day Hour
1 1 1 1 0
2 1 1 1 3
...
etc
I wrote a function which I then tried to vectorize by using apply in order to run calculations row-by-row basis, but it doesn't work due to the booleans:
day_in_season<-function(tarr){
#first month in season
if((tarr$month==12) || (tarr$month==3) ||(tarr$month==6) || (tarr$month==9)){
d=tarr$day
#second month in season
}else if ((tarr$month==1) || (tarr$month==4)){
d=31+tarr$day
}else if((tarr$month==7) || (tarr$month==10)){
d=30+tarr$day
#third month in season
}else if((tarr$month==2)){
d=62+tarr$day
}else{
d=61+tarr$day
}
h=tarr$hour/24
d=d+h
return(d)
}
I tried
apply(tdjf,1,day_in_season)
but it raised this exception:
Error in tarr$month : $ operator is invalid for atomic vectors
(I already knew about this potential pitfall, but that's why I wanted to use apply in the first place!)
The only way I can currently get it to work is if I do this:
days<-c()
for (x in 1:nrow(tdjf)){
d<-day_in_season(tdjf[x,])
days=append(days,d)
}
If there were only a few values, I'd throw up my hands and just use the for loop, efficiency be damned, but I have over 15,000 rows and that's just one dataset. I know that there has to be a way to make it work.
To vectorize your code, use ifelse() and| instead of ||:
ifelse(
(tarr$month==12) | (tarr$month==3) |(tarr$month==6) | (tarr$month==9),
tarr$day,
ifelse((tarr$month==1) | (tarr$month==4),
31+tarr$day,
ifelse((tarr$month==7) | (tarr$month==10),
30+tarr$day,
ifelse(tarr$month==2,
62+tarr$day,
61+tarr$day)
)
)
)+tarr$hour/24
You might be surprised at how quickly a well constructed for loop can run. If designed well, it has about the same efficiency of an apply statement.
The properfor loop in your case is
tdjf$days <- vector ("numeric", nrow (tdjf))
for (x in seq_along (tdjf$days)){
tdjf$days [x] <- day_in_season(tdjf[x,])
}
If you really want to go the apply route, I would recommend rewriting your function to take three arguments -- month, day, and hour -- and pass those three columns into mapply

Stata counting substring

My table looks like this:
ID AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
The count for the full 8-character AQ_ATC codes is already correct.
The shorter codes are unique in the table and are substrings of the complete 8-character codes (they represent the first x characters).
What I am looking for is the count of the appearances of the shorter codes throughout the entire table.
For example in this case the resulting table would be
ID AQ_ATC amountATC
. "A05" 2715 <-- 2525 + 190
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 7430 <-- 4330 + 3100
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 130 <-- 130
441 "C05AA03" 130
The partial codes do not overlap, by what I mean that if there is "C05" there wont be another partial code "C05A1".
I created the amountATC column using
bysort ATC: egen amountATC = total(AQ_ATC==AQ_ATC)
I attempted recycling the code that I had received yesterday but failed in doing so.
My attempt looks like this:
levelsof AQ_ATC, local(ATCvals)
quietly foreach y in AQ_ATC {
local i = 0
quietly foreach x of local ATCvals {
if strpos(`y', `"`x'"') == 1{
local i = `i'+1
replace amountATC = `i'
}
}
}
My idea was to use a counter "i" and increase it by 1 everytime the an AQ_ATC starts with another AQ_ATC code. Then I write "i" into amountATC and after I iterated over the entire table for my AQ_ATC, I will have an "i"-value that will be equal to the amount of occurences of the substring. Then I reset "i" to 0 and continue with the next AQ_ATC.
At least thats how I intended for it to work, what it did in the end is set all amountATC-values to 1.
I also attempted looking into different egen-functions such as noccur and moss, but my connection keeps timing out when I attempt to install the packages.
It seems as if you come from another language and you insist in using loops when not strictly necessary. Stata does many things without explicit loops, precisely because commands already apply to all observations.
One way is:
clear
set more off
input ///
ID str15 AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
end
*----- what you want -----
sort AQ_ATC ID
gen grou = sum(missing(ID))
bysort grou AQ_ATC: gen tosum = amountATC if _n == 1 & !missing(ID)
by grou: egen s = total(tosum)
replace amountATC = s if missing(ID)
list, sepby(grou)
Edit
With your edit the same principles apply. Below code that adjusts to your change and slightly changes the code (one line less):
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1] & !missing(ID)
by grou: replace amountATC = s[_N] if missing(ID)
More efficient should be:
<snip>
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1]
by grou: replace amountATC = s[_N] - 1 if missing(ID)
Some comments:
sort is a very handy command. If you sort the data by AQ_ATC they are arranged in such a way that the short (sub)strings are placed before corresponding long strings.
The by: prefix is fundamental and very helpful, and I noticed you can use it after defining appropriate groups. I created the groups taking advantage of the fact that all short (sub)strings have a missing(ID).
Then (by the groups just defined) you only want to add up one value (observation) per amountATC. That's what the condition if AQ_ATC != AQ_ATC[_n+1] does.
Finally, replace back into your original variable. I would usually generate a copy and work with that, so my original variable doesn't suffer.
An excellent read for the by: prefix is Speaking Stata: How to move step by: step, by Nick Cox.
Edit2
Yet another slightly different way:
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
egen t = tag(grou AQ_ATC)
bysort grou: gen s = sum(amountATC * t)
by grou: replace amountATC = s[_N] - 1 if missing(ID)

sql output as a list in ksh

I have a script where the sql output of the function is multiple rows (one column) and I'm trying to loop through those for loop function but can't get to seem to get it to work...
rslt=sqlquery {}
echo $rslt
1
2
3
4
for i in $rslt
do
echo "lvl$i"
done
but for the loop...I keep getting this back four times
lvl1
2
3
4
where as I want to get this back...
lvl1
lvl2
lvl3
lvl4
how do I get that?
In order to get the needed result in your script you need to take $rslt under double quotes ". This will ensure that you are not loosing the new lines \n from you result which you are expecting to have in the loop.
for i in "$rslt"
do
echo "lvl$i"
done
To loop over the values in a ksh array, you need to use the ${array[#]} syntax:
$ set -A rslt 1 2 3 4
$ for i in ${rslt[#]}
> do
> echo "lvl$i"
> done
lvl1
lvl2
lvl3
lvl4

Resources