I have been working on a data extraction task using the Titan Lite model. I've noticed repeated instances where redaction appears to be applied to the response automatically. This makes the model unusable for the intended task.
Below is what should be a repeatable example using Temp=0 and Top P = 0.9 in the Text Playground for Titan Text G1 - Lite v1 base model.
prompt:
Repeat the information in the <input> brackets.
<input>
date,code,code_2,code_3,desc,qty,charge,page,row_id,num
04/17/2021,A,X,,Desc1,(844) 545-5640,c613fb99-3c8b-5748-bef9-64191167fe36,1
04/17/2021,B,Y,,Desc2,2,25.17,0,f52b694c-633e-51fe-9fed-f3a1eb5e60e5,1
04/17/2021,C,,Z,Desc3,8445455640,c613fb99-3c8b-5069-84c9-53b0723f31ef,2
04/17/2021,A,X,,Desc1,(844) 545-5640,c613fb99-3c8b-5748-bef9-64191167fe36,1
</input>
response:
date,code,code_2,code_3,desc,qty,charge,page,row_id,num
04/17/2021,A,X,,Desc1,(844) 545-5640,{MAC_ADDRESS-1}c8b-5748-bef9-64191167fe36,1
04/17/2021,B,Y,,Desc2,2,25.17,0,f52b694c-633e-51fe-9fed-f3a1eb5e60e5,1
04/17/2021,C,,Z,Desc3,8445455640,c613fb99-3c8b-5069-84c9-53b0723f31ef,2
04/17/2021,A,X,,Desc1,(844) 545-5640,{MAC_ADDRESS-2}8-bef9-64191167fe36,1
Notice how 2 of the uuids are being effectively redacted in the output with {MAC_ADDRESS-1} and {MAC_ADDRESS-2}. I've also seen where this happens for codes that resemble but are not {PHONE_NUMBER} and values that are redacted to {IP_ADDRESS}.
Note I include the IRS phone number here, but it was not redacted, though I've seen other cases where similar looking values have been masked. I see the same behavior using fine tuned models of Titan Lite.
Has anyone else experienced this phenomenon?
Any prompt engineering tips to prevent this behavior?